SlideShare a Scribd company logo
Unit I Introduction
1. Data: The values recorded in an experiment or observation is called data.
1.1. Types of Data:
1.1.1. Primary Data: The data collected by an investigator is called primary data. It is first hand
information.
1.1.2. Secondary Data: The data collected from another source is called secondary data. Eg.
Data collected from newspapers, journals etc.
2. Biological Data: Biological data are data or measurements collected from biological sources,
which are often stored or exchanged in a digital form.
Eg. Examples of biological data are DNA base-pair sequences, and population data used in
ecology.
2.1. Data Measurement Scale: There are four data measurement scales.
2.1.1. Nominal Scale: Nominal scales are used for labeling variables, without
any quantitative value. “Nominal” scales could simply be called “labels.” Here are some
examples, below.
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called
“dichotomous.”
2.1.2. Ordinal Scale: Ordinal scales are typically measures of non-numeric concepts like
satisfaction, happiness, discomfort, etc. Examples of ordinal scales.
2.2. Types of Biological Data:
2.2.1. Continuous Data: Continuous data are having value between two specified values; it is
called a continuous data.
 It is not countable but measurable.
 Values can be in integers or in decimal.
 It is a qualitative data.
 It has infinite or unlimited number of values.
E.g. Height of students, weight of students etc.
2.2.2. Discrete Data: The values are countable.
 Data is in the form of integers (whole number) and not decimal.
 It is quantitative.
 It has finite number of values.
 It is discontinuous.
E.g. Number of absentees in a class.
3. Graphical Distribution: Presenting data in the form of graph is called graphic presentation of
data.
3.1. Graph:
 A graph is the geometric image of a data.
 A graph is a diagram consisting of lines of statistical data.
 A graph has two intersecting lines called axes.
 The horizontal line is called X-axis. The vertical line is called Y-axis.
3.2. Frequency Distribution Graphs: Graphs obtained by plotting grouped data are called
frequency distribution graphs.
3.2.1. Histogram: Histogram is a graph containing frequencies in the form of vertical rectangles.
It is an area diagram.
 It is a graphical presentation of frequency distribution.
 The X – axis is marked with class intervals.
 The Y – axis is marked with frequencies.
 Vertical rectangles are drawn as per the height of the frequency of each class. Rectangles
are drawn without any gap in between.
 Histogram is a two dimensional diagram.
Fig: Example of a histogram
3.2.2. Bar Graph:
 A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.
 The bars can be plotted vertically or horizontally.
 A vertical bar chart is sometimes called a line graph.
 A bar graph shows comparisons among discrete categories.
Fig: Example of a Bar Graph
3.2.3. Box Plot:
 In descriptive statistics, a box plot is a method for graphically depicting groups of
numerical data through their quartiles.
 Box plots may also have lines extending vertically from the boxes (whiskers) indicating
variability outside the upper and lower quartiles, hence the terms box-and-whisker
plot and box-and-whisker diagram.
 Outliers may be plotted as individual points.
 Box plots are non-parametric: they display variation in samples of a statistical
population without making any assumptions of the underlying statistical distribution
Fig: Example of a Box Plot
3.2.4. Frequency Polygon:
 Polygon is a histogram with straight lines joining the midpoints of the top of the
rectangles.
 Polygon means a figure with many angles.
 It is an area diagram. Polygon is a graph. It is the graphical representation of
frequency distribution.
 The X – axis is marked with class intervals.
 The Y – axis is marked with frequencies.
 The mid points of the top of the rectangles are joined by straight lines.
3.2.4.1. Uses of Frequency Polygon:
 It simplifies a complex data.
 It gives an idea of the pattern of distribution of variables in the population.
 It facilitates comparison of two or more frequency distribution on the same graph.
 It gives a clear picture of the same data.
Fig: Example of a Frequency Polygon
3.3. Cumulative Frequency Distribution: The cumulative frequency distribution is a statistical
table where the frequencies of preceding classes are added. As per example:
Class Frequency
0 – 9
10 – 19
20 – 29
30 – 39
3
9
11
7
Total Frequency 30
Table: Continuous Frequency Distribution
Class Frequency Cumulative Frequency
0 – 9
10 – 19
20 – 29
30 – 39
3
9
11
7
3
12
23
30
Table: Cumulative Frequency Distribution
4. Population:
 In biology, a population is all the organisms of the same group or species, which live in
a particular geographical area, and have the capability of interbreeding.
 The area of a sexual population is the area where inter-breeding is potentially possible
between any pair within the area, and where the probability of interbreeding is greater
than the probability of cross-breeding with individuals from other areas.
Fig: The distribution of human world population in 1994
5. Sampling:
 Sampling is a method of collection of data.
 Sample is a representative fraction of a population.
 When the population is very large or infinite, sampling is the suitable method for data
collection.
 Example: The Oxygen content of pond water can be found by titrating just 100 ml of
water.
 There are two types of sampling, namely
1. Random Sampling.
2. Non-random Sampling.
5.1. Random Sampling:
 In random sampling a small group is selected from a large population without any aim or
predetermination. The small group is called sample.
 In this method each item of population has an equal and independent chance of being
included in the sample.
 The random sample is selected by lottery method.
5.1.1. Simple Random Sampling:
 In this method a sample is selected by which each item of the population has an equal and
independent chance of being included in the sample.
 In this method, certain number of its are chosen at random without any pre-determined
basis.
5.1.2. Stratified Random Sampling:
 This sampling technique is generally recommended when the population is
heterogeneous.
 In this method, whole of the population is divided into strata or sub groups possessing the
similar characteristics.
 Samples are selected taking equal proportion of items from each group.
 Example: We want to select 100 students from a population of 1000 students, consist of
700 girls and 300 boys. So the whole population is divided into two strata – 700 girls and
300 boys. Now by simple sampling method selection of 70 girls and 30 boys are done to
get sample of 100 students.
5.1.3. Systematic Random Sampling:
 It is also known as Quasi Random Sampling.
 In this method, all the items are arranged in some spatial or temporal order.
 Example: Persons listed alphabetically in a telephone directory, plants growing in rows
in field.
Unit II Descriptive Statistics
1. Measures of Central Tendency:
 A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position wi
 Measures of central tendency are sometimes called measures of central location.
 They are also classed as summary statistics.
1.1. Mean (Arithmetic):
The mean is equal to the sum of all the values in the data set divided by the number
the data set. So, if we have n values in a data set and they have values x
mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol le
pronounced "sigma", which means "sum of...":
1.1.1. Significance:
 One of its important properties is that it
value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
 An important property of the mean is that it includes every value in your data set as part
of the calculation.
 In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
1.2. Median:
The median is the middle score for a set of data that has been arranged in order of
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
1. Measures of Central Tendency:
A measure of central tendency is a single value that attempts to describe a set of data
by identifying the central position within that set of data.
of central tendency are sometimes called measures of central location.
They are also classed as summary statistics.
The mean is equal to the sum of all the values in the data set divided by the number
the data set. So, if we have n values in a data set and they have values x1, x2, ..., x
(pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol le
pronounced "sigma", which means "sum of...":
of its important properties is that it minimizes error in the prediction of any one
value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part
In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
The median is the middle score for a set of data that has been arranged in order of
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
A measure of central tendency is a single value that attempts to describe a set of data
of central tendency are sometimes called measures of central location.
The mean is equal to the sum of all the values in the data set divided by the number of values in
, ..., xn, the sample
This formula is usually written in a slightly different manner using the Greek capitol letter, ,
error in the prediction of any one
value in your data set. That is, it is the value that produces the lowest amount of error
An important property of the mean is that it includes every value in your data set as part
In addition, the mean is the only measure of central tendency where the sum of the
The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median,
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it. This works fine when there is an odd
number of scores, in case of even number of scores (for example 10 scores) simply we need to
take the middle two scores and average the result. Example:
65 55 89 56 35 14 56 55 87 45
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to get a median
of 55.5.
1.3. Mode:
The mode is the most frequent score in our data set. On a histogram it represents the highest bar
in a bar chart or histogram. Normally, the mode is used for categorical data where we wish to
know which the most common category, as illustrated below:
To find out mode of an ungrouped data, the values are arranged in an ascendig order. The value
which occurs maximum number of times is the mode.
18 21 23 23 25 25 25 27 29 29
In the above data 25 occurs maximum number of times. So 25 is the mode.
However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:
Mode is a positional average. It is a measure of central value. When data has one concentration
of frequency, it is called unimodal. When it has more than one concentration it is called
bimodal (for 2 concentrations), trimodal (for 3 concentration) simultaneously.
1.3.1. Significance:
 No need for mathematical calculation.
 Mode can be easily found out.
 Although it is not that much reliable.
1.4. Range:
 Range is the difference between the lowest value and highest value of a set of data.
 Range = Largest value (Xm) – Smallest value (X0)
1.4.1. Coefficient of Range:
This is a relative measure of dispersion and is based on the value of the range. It is also called
range coefficient of dispersion. It is defined as:
2.1. Variance: Variance is the average of the square differences from the mean. Steps involved
in calculating the variance:
i. Calculate mean.
ii. Subtract the mean from each value.
iii. Square the result.
iv. Add the squared numbers.
v. Take the average of the squared results.
2.2. Standard Deviation:
 Standard deviation is a measure of deviation.
 The Standard Deviation is a measure of how spreads out numbers are.
 Its symbol is SD or σ (the Greek letter sigma).
 The formula is: it is the square root of the Variance.
Example:
The heights of the dogs (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
The first step is to find the Mean:
Mean = (600 + 470 + 170 + 430 + 300)/ 5
= 1970/5
= 394
To calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2
= { 2062
+ 762
+ (−224)2
+ 362
+ (−94)2
} / 5
= (42436 + 5776 + 50176 + 1296 + 8836) / 5
= 108520 / 5
= 21704
So the Variance is 21,704
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation (σ) = √21704
= 147.32...
= 147 (to the nearest mm)
Therefore,
The square of the standard deviation is called variance.
2.3. Coefficient of Variations (CV): Coefficient of Variations is the standard deviation
expressed as a percentage of the mean. It is a relative measure of the mean.
Coefficient of Variations = (SD / X) * 100
Where,
SD = Standard Deviation, X = Mean
2.4. Grouped Data:
 Grouped data is data that has been organized into groups known as classes.
 Grouped data has been 'classified' and thus some level of data analysis has taken place,
which means that the data is no longer raw.
 A data class is group of data which is related by some user defined property. For
example, while collecting the ages of the people we could group them into classes as
those in their teens, twenties, thirties, forties and so on. Each of those groups is called a
class.
 Each of those classes is of a certain width and this is referred to as the Class
Interval or Class Size.
 This class interval is very important when it comes to drawing Histograms and Frequency
diagrams. All the classes may have the same or different class size.
Below is an example of grouped data where the classes have the same class interval.
Age (years) Frequency
0 - 9 12
10 - 19 30
20 - 29 18
30 - 39 12
40 - 49 9
50 - 59 6
60 - 69 0
Below is an example of grouped data where the classes have different class interval.
Age (years) Frequency Class Interval
0 - 9 15 10
10 - 19 18 10
20 - 29 17 10
30 - 49 35 20
50 - 79 20 30
2.5. Graphical Methods: These methods are applied to visually describe data from a sample or
population. Graphs provide visual summaries of data which is more quickly and completely
describe essential information than tables of numbers.
There are many types of graphical representation:
2.5.1. The Bar Chart: To Construct a Bar Chart,
 Place categories on the horizontal axis,
 Then place frequency (or relative frequency) on the vertical axis.
 After that construct vertical bars of equal width, one for each category.
 Its height is proportional to the frequency (or relative frequency) of the
category.
Fig: Example of a Bar Chart
2.5.2. The Pie Chart: For drawing pie chart,
 Make complete circle that represents the total number of measurements.
Partition into slices - one for each category.
 Then, the size of a slice is proportional to the relative frequency of that
category.
 Determine the angle of each slice by multiplying the relative frequency by
360 degree.
Fig: Example of a Pie Chart, Use of different Web Browser
2.5.3. Histogram: Histogram is a graph containing frequencies in the form of vertical rectangles.
It is an area diagram.
 It is a graphical presentation of frequency distribution.
 The X – axis is marked with class intervals.
 The Y – axis is marked with frequencies.
 Vertical rectangles are drawn as per the height of the frequency of each class. Rectangles
are drawn without any gap in between.
 Histogram is a two dimensional diagram.
Fig: Example of a histogram
2.5.4. Quantile plots: These visually portray the quantiles, or percentiles (which equals to the
quantiles times 100) of the distribution of sample data. Quantiles of importance such as the
median are easily discerned (quantile, or cumulative frequency = 0.5). Main benefits of Quantile
plots are as follows:
i. Arbitrary categories are not required, as with histograms or S-L's.
ii. All of the data are displayed, unlike a box plot.
iii. Every point has a distinct position, without overlap.
Fig: Example of a Quantile plots
2.5.5. Box Plot:
 In descriptive statistics, a box plot is a method for graphically depicting groups of
numerical data through their quartiles.
 Box plots may also have lines extending vertically from the boxes (whiskers) indicating
variability outside the upper and lower quartiles, hence the terms box-and-whisker
plot and box-and-whisker diagram.
 Outliers may be plotted as individual points.
 Box plots are non-parametric: they display variation in samples of a statistical
population without making any assumptions of the underlying statistical distribution
Fig: Example of a Box Plot
2.5.6. Benefits of Graphical representation:
1. Acceptability: Graphical report is acceptable to people who have busy schedule because
it easily highlights about the theme of the report. This helps to avoid wastage of time.
2. Comparative Analysis: Information can be compared in terms of graphical
representation. Such comparative analysis helps for quick understanding and attention.
3. Less cost: Information, if descriptive, involves huge time to present properly. It involves
more money to print the information but graphical presentation can be made in short but
catchy view to make the report understandable. It obviously involves less cost.
4. Decision Making: Business executives can view the graphs at a glance and can make
decision very quickly which is hardly possible through descriptive report.
5. Logical Ideas: If tables, design and graphs are used to represent information then a
logical sequence is created to clear the idea of the audience.
6. Helpful for less educated Audience: Less literate or illiterate people can understand
graphical representation easily because it does not involve going through line by line of
any descriptive report.
7. Less Effort and Time: To present any table, design, image or graphs require less effort
and time. Furthermore, such presentation makes quick understanding of the information.
8. Less Error and Mistakes: Qualitative or informative or descriptive reports involve
errors or mistakes. As graphical representations are exhibited through numerical figures,
tables or graphs, it usually involves less error and mistake.
9. A complete Idea: Such representation creates clear and complete idea in the mind of
audience. Reading hundred pages may not give any scope to make decision. But an
instant view or looking at a glance obviously makes an impression in the mind of
audience regarding the topic or subject.
10. Use in the Notice Board: Such representation can be hanged in the notice board to
quickly raise the attention of employees in any organization.
2.5.7. Graphical representation has some drawbacks also:
1. Expensive: Graphical representations of reports are costly because it involves images,
colors and paints. Combination of material with human efforts makes the graphical
presentation expensive.
2. More time: Graphical representation involves more time as it requires gra
which are dependent to more time.
3. Errors and Mistakes: Since graphical representations are complex, there is chance of
errors and mistake. This causes problems for better understanding to general people.
4. Lack of Privacy: Graphical represent
may hamper the objective to keep something secret.
5. Problems to select the appropriate method:
various graphical methods and ways. Which should be the suitable method is
select.
6. Problem of Understanding:
representation because it involves various technical matters which are complex to general
people.
2.6. Obtaining Descriptive Statistics on Computer (MS Ex
Suppose, we may have the scores of 14 participants for a test.
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click
Graphical representation has some drawbacks also:
Graphical representations of reports are costly because it involves images,
and paints. Combination of material with human efforts makes the graphical
Graphical representation involves more time as it requires gra
which are dependent to more time.
Since graphical representations are complex, there is chance of
errors and mistake. This causes problems for better understanding to general people.
Graphical representation makes full presentation of information which
may hamper the objective to keep something secret.
Problems to select the appropriate method: Information can be presented through
various graphical methods and ways. Which should be the suitable method is
Problem of Understanding: All people cannot understand the meaning of graphical
representation because it involves various technical matters which are complex to general
Obtaining Descriptive Statistics on Computer (MS Excel):
may have the scores of 14 participants for a test.
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.
Graphical representations of reports are costly because it involves images,
and paints. Combination of material with human efforts makes the graphical
Graphical representation involves more time as it requires graphs and figures
Since graphical representations are complex, there is chance of
errors and mistake. This causes problems for better understanding to general people.
ation makes full presentation of information which
Information can be presented through
various graphical methods and ways. Which should be the suitable method is very hard to
All people cannot understand the meaning of graphical
representation because it involves various technical matters which are complex to general
Note: can't find the Data Analysis button? Click here to load the
2. Select Descriptive Statistics and click OK.
3. Select the range A2:A15 as the Input Ran
4. Select cell C1 as the Output Range.
5. Make sure summary statistics is checked.
6. Click OK.
can't find the Data Analysis button? Click here to load the Analysis ToolPak
2. Select Descriptive Statistics and click OK.
3. Select the range A2:A15 as the Input Range.
4. Select cell C1 as the Output Range.
5. Make sure summary statistics is checked.
ToolPak add-in.
Result:
3. Case Study:
In the social sciences and life sciences
in-depth, and detailed examination of a subject of study (the
contextual conditions.
Case studies can be produced by following a formal
likely to appear in formal research venues, as journals and professional conferences, rather
popular works. The resulting body of 'case study research' has long had a prominent place in
many disciplines and professions, ranging from psychology, anthropology, sociology, and
political science to education, clinical science, social work, and ad
3.1. Types of Case Studies:
Under the more generalized category of case study exist several subdivisions, each of which is
custom selected for use depending upon the goals of the investigator. These types of case study
include the following:
 Illustrative case studies: These are primarily descriptive studies. They typically utilize one
or two instances of an event to show the existing situation. Illustrative case studies serve
primarily to make the unfamiliar familiar and to give reade
topic in question.
 Exploratory (or pilot) case studies
implementing a large scale investigation. Their basic function is to help identify questions
and select types of measurement prior to the main investigation. The primary pitfall of this
life sciences, a case study is a research method involving an up
depth, and detailed examination of a subject of study (the case), as well as its related
Case studies can be produced by following a formal research method. These case studies are
likely to appear in formal research venues, as journals and professional conferences, rather
popular works. The resulting body of 'case study research' has long had a prominent place in
many disciplines and professions, ranging from psychology, anthropology, sociology, and
political science to education, clinical science, social work, and administrative science.
Under the more generalized category of case study exist several subdivisions, each of which is
custom selected for use depending upon the goals of the investigator. These types of case study
These are primarily descriptive studies. They typically utilize one
or two instances of an event to show the existing situation. Illustrative case studies serve
primarily to make the unfamiliar familiar and to give readers a common language about the
Exploratory (or pilot) case studies: These are condensed case studies performed before
implementing a large scale investigation. Their basic function is to help identify questions
ement prior to the main investigation. The primary pitfall of this
is a research method involving an up-close,
), as well as its related
method. These case studies are
likely to appear in formal research venues, as journals and professional conferences, rather than
popular works. The resulting body of 'case study research' has long had a prominent place in
many disciplines and professions, ranging from psychology, anthropology, sociology, and
ministrative science.
Under the more generalized category of case study exist several subdivisions, each of which is
custom selected for use depending upon the goals of the investigator. These types of case study
These are primarily descriptive studies. They typically utilize one
or two instances of an event to show the existing situation. Illustrative case studies serve
rs a common language about the
These are condensed case studies performed before
implementing a large scale investigation. Their basic function is to help identify questions
ement prior to the main investigation. The primary pitfall of this
type of study is that initial findings may seem convincing enough to be released prematurely
as conclusions.
 Cumulative case studies: These serve to aggregate information from several sites collected
at different times. The idea behind these studies is that the collection of past studies will
allow for greater generalization without additional cost or time being expended on new,
possibly repetitive studies.
 Critical instance case studies: These examine one or more sites either for the purpose of
examining a situation of unique interest with little to no interest in generalization, or to call
into question a highly generalized or universal assertion. This method is useful for answering
cause and effect questions.
Unit III Probability and Distribution:
1. Probability: Probability is the proportion of times an event occurs in a set of trials. The word
‘probability’ means chance; likely to happen.
Probability is calculated by following formula:
P = e / t
Where,
P = Probability
e = number of times an event occurs or frequency
t = total number of trials or items
The probability value is always a fraction falling between 1 and 0.
Example:
When a dice numbered from 1 to 6 is tossed, the total number of chance is 6. The
probability of any number is, 1/6 = 0.17
p is the probability of an event to occur. Q is the probability of the event not occurring.
So,
When, p is known, q can be calculated.
q = 1 – p
p = 1 - q
1.1. Laws of Probability: There are two types of theorems of probability, namely
1. Addition theorems. 2. Multiplications theorems.
1.1.1. Addition Theorem:
 The probability of the occurrence of a mutually exclusive event is the sum of the
individual probabilities of the events.
 Mutually exclusive events cannot occur simultaneously.
 The occurrence of the one event prevents the occurrence of the other events.
 Example: In coin tossing experiment, the occurrence of head excludes the occurrence of
tail.
If probability of head is A and that of tail is B.
Then,
Probability of Head = p(A) = p(A) + p(B)
Probability of Tail = p(B) = p(A) + p(B)
1.1.2. Multiplication Theorem:
 The probability of the occurrence of two independent events is the product of their
individual probabilities.
 For independent event, the probability is calculated by multiplication.
 The independent event will not affect the occurrence of other events. When two coins are
tossed, the result of the first coin does not affect the second coin.
 Example: The probability of two independent events as P(A) and P(B)
Probability of A = P(A) = P(A) x P(B)
Probability of B = P(B) = P(A) x P(B)
2. Random events: the events whose outcome is unknown are called random experiments. For
example when we toss a coin, we do not know if it will land heads up or tails up. Hence tossing a
coin is a random experiment. Another example is the result of an interview or examination.
When we speak about random experiments, we have to know what the sample space is.
Sample space denoted by S is the set of all possible outcomes of a random experiment.
Example: consider the random experiment of tossing a die. Let us write down the sample space
S here.
The sample space is all the possible outcomes here. What are the possible outcomes when we
toss a die once? As we know a die has 6 faces numbered 1,2,3,4,5,6. When we toss it once, only
one of the face will turn up. Hence the sample space is
S= {1,2,3,4,5,6}
Consider one more simple example of tossing two coins. Let us write down the sample space
here.
S= {(H,H),(H,T),(T,H),(T,T)}
Here,
H: head; T: tail
3. Events-exhaustive: Two or more events are said to be exhaustive if there is a certain chance
of occurrence of at least one of them when they are all considered together. Exhaustive event can
be either elementary or even compound.
Example: consider the experiment of a fair die being thrown. Then there are six
outcomes and all of them are equally likely to occur. Also the events of getting different numbers
taken together are exhaustive as together at least one of them is certain to happen. For getting a 2
or 5, sure will get one of the numbers during the experiment. So events are exhaustive.
4. Mutually Exclusive Event: Mutually exclusive events cannot occur together simultaneously.
The occurrence of one event prevents the occurrence of the other event. The mutually exclusive
events are connected by the words ‘either or’.
Example: Head and tail of a coin.
5. Equally Likely Events: Equally likely events have equal chances of occurrence.
Example: Winning or losing in a game. Head or tail of a coin.
6. Binomial Distribution: A binomial distribution can be thought of as simply the probability
of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times.
The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means
two, or twice).
Examples of binomial distribution problems:
 The number of defective/non-defective products in a production run.
 Yes/No Survey (such as asking 150 people if they watch ABC news).
 Vote counts for a candidate in an election.
 The number of successful sales calls.
 The number of male/female workers in a company
6.1. Criteria of Binomial distributions:
i. The number of observations or trials is fixed. In other words, you can only figure out
the probability of something happening if you do it a certain number of times. This is
common sense—if you toss a coin once; your probability of getting a tails is 50%. If
you toss a coin a 20 times, your probability of getting a tails is very, very close to
100%.
ii. Each observation or trial is independent. In other words, none of your trials have an
effect on the probability of the next trial.
iii. The probability of success (tails, heads, fail or pass) is exactly the same from one trial
to another.
6.2. Formula:
Notations for Binomial Distribution and the Mass Formula:
Where:
 P is the probability of success on any trail.
 q = 1- P – the probability of failure
 n – the number of trails/experiments
 x – The number of successes, it can take the values 0, 1, 2, 3 . . . n.
 nCx = n!/x!(n-x) and denotes the number of combinations of n elements
taken x at a time.
Assuming what the nCx means, we can write the above formula in this way:
Problem: A box of candies has many different colors in it. There is a 15% chance of getting a
pink candy. What is the probability that exactly 4 candies in a box are pink out of 10?
We have that:
n = 10, p=0.15, q=0.85, x=4
When we replace in the formula:
Interpretation: The probability that exactly 4 candies in a box are pink is 0.04.
6.3. Properties of binomial distribution:
1. Binomial distribution is applicable when the trials are independent and each trial has just
two outcomes success and failure. It is applied in coin tossing experiments, sampling
inspection plan, genetic experiments and so on.
2. Binomial distribution is known as bi-parametric distribution as it is characterized by two
parameters n and p. This means that if the values of n and p are known, then the distribution is
known completely.
3. The mean of the binomial distribution is given by
μ = np
4. Depending on the values of the two parameters, binomial distribution may be uni-modal or bi-
modal.
To know the mode of binomial distribution, first we have to find the value of (n+1)p.
(n+1)p is a non integer --------> Uni-modal
Here, the mode = the largest integer contained in (n+1)p
(n+1)p is a integer --------> Bi-modal
Here, the mode = (n+1|)p, (n+1)p - 1
5. The variance of the binomial distribution is given by
σ² = npq
6. Since p and q are numerically less than or equal to 1,
npq < np
That is, variance of a binomial variable is always less than its mean.
7. Variance of binomial variable X attains its maximum value at p = q = 0.5 and this maximum
value is n/4.
8. Additive property of binomial distribution.
Let X and Y are the two independent binomial variables.
X is having the parameters n₁ and p
and
Y is having the parameters n₂ and p.
Then (X+Y) will also be a binomial variable with the parameters (n₁ + n₂) and p
7. Poisson Distribution:
 Poisson distribution was devised by Poisson in 1837.
 It is a discrete frequency distribution.
 Poisson distribution describes the occurrence of rate events and the small events. Hence it
is called law of improbable events.
 When the probability of the event is very rare in a large number of trials, the resulting
distribution is called Poisson distribution.
 Example: Number of death due to heart attack in a hospital or a town.
7.1. Properties of Poisson Distribution:
 The probability of the success of the event (p) is very small and approaches zero.
 The probability of the failure of the event (q) is very high and almost equal to 1 and n is
also large.
 Poisson distribution has a single parameter called mean denoted by m.
m = np = constant.
 The formula used for Poisson distribution is as follows:
Probability of r success P(r) = e-m
mr
/ 1
p = Probability
r = 0, 1, 2, 3….n successes.
e = 2.7183 (constant)
 SD (Standard Deviation) of Poisson distribution is = √m = √np
 Variance = SD2
= m = np
8. Normal Distribution: Normal distribution is a continuous probability distribution. In this
distribution the values are clustered closely around the centre and the values decrease towards
the left and right.
Example: The height of students in a class is a typical example for normal distribution.
The height of most students will be between 150cm and 170cm. The height of only a few
students will be less than 150 cm and the height of only a few students will be above 170
Thus there is an increasing number towards the middle point and a decreasing number towards
the end.
8.1. Properties of Normal Distribution:
 The graph obtained for normal distribution is called normal distribution curve.
 The normal distribution curv
number of individuals (frequency) in the Y axis.
 The normal distribution curve is symmetrical. It is bell shaped.
Fig: Example of a Normal Distribution Curve
 Normal distribution curve is als
Coral Gauss.
 The normal distribution curve is a continuous distribution. It is associated with height,
weight, age, rate of respiration etc.
 It has only one maximum peak. Hence it is a unimodel curve.
 The height of normal curve is maximum at its mean.
 Mean, median and mode are equal for normal distribution. Mean = Median = Mode.
 Most of the values are clustered around the mean and there are relatively
observations at the extreme ends.
The height of students in a class is a typical example for normal distribution.
students will be between 150cm and 170cm. The height of only a few
will be less than 150 cm and the height of only a few students will be above 170
Thus there is an increasing number towards the middle point and a decreasing number towards
8.1. Properties of Normal Distribution:
The graph obtained for normal distribution is called normal distribution curve.
The normal distribution curve is obtained when the values are given in the X axis and the
number of individuals (frequency) in the Y axis.
The normal distribution curve is symmetrical. It is bell shaped.
Example of a Normal Distribution Curve
Normal distribution curve is also called Gaussian curve, named after the discoverer
The normal distribution curve is a continuous distribution. It is associated with height,
weight, age, rate of respiration etc.
It has only one maximum peak. Hence it is a unimodel curve.
The height of normal curve is maximum at its mean.
Mean, median and mode are equal for normal distribution. Mean = Median = Mode.
Most of the values are clustered around the mean and there are relatively
observations at the extreme ends.
The height of students in a class is a typical example for normal distribution.
students will be between 150cm and 170cm. The height of only a few
will be less than 150 cm and the height of only a few students will be above 170cm.
Thus there is an increasing number towards the middle point and a decreasing number towards
The graph obtained for normal distribution is called normal distribution curve.
e is obtained when the values are given in the X axis and the
o called Gaussian curve, named after the discoverer
The normal distribution curve is a continuous distribution. It is associated with height,
Mean, median and mode are equal for normal distribution. Mean = Median = Mode.
Most of the values are clustered around the mean and there are relatively a few
 The normal curve never touches the horizontal axis.
 The mean deviation is equal to standard deviation.
Unit IV: Correlation and Regression Analysis
1. Correlation:
 Correlation, in the finance and investment industries, is a statistic that measures the
degree to which two securities move in relation to each other.
 Correlations are used in advanced portfolio management, computed as the correlation
coefficient, which has a value that must fall between -1.0 and +1.0.
 Correlation is a statistic that measures the degree to which two variables move in relation
to each other.
1.1. Definition of Correlation:
 According to Taro Yamane, “Correlation analysis is a discussion of the degree of
closeness of the relationship between two variables.”
 According to Ya Lun Chou, “Correlation analysis attempts to determine the degree of
relationship between variables.”
 According to Prof. Bodding, “Wherever some definite connection exists between 2 or
more groups, classes or series of data, there is said to be a correlation.”
 A very simple definition is given by A. M. Tuttle, “An analysis of the co-variation of two
or more variables is usually called correlation.”
1.2. The Formula for Correlation:
Correlation measures association, but does not tell you if x causes y or vice versa, or if
the association is caused by some third (perhaps unseen) factor.
1.3. Positive Correlation: A perfect positive correlation means that the correlation coefficient is
exactly 1. This implies that as one security moves, either up or down, the other security moves in
lockstep, in the same direction.
1.4. Negative Correlation: A perfect negative correlation means that two assets move in
opposite directions, while a zero correlation implies no relationship at all.
1.5. Calculation of Correlation: (Karl Pearson’s Coefficient of Correlation)
Karl Pearson, a great biometrician and statistician, suggested a mathematical method for
measuring the magnitude of linear relationship between two variables.
Karl Pearson’s method is the most widely used method in practice and is known as
Pearsonian coefficient of correlation. It is denoted by the symbol “ ”. The simplest formula
is-
The value of the coefficient of correlation shall always lie between +1 and -1, when =
+1, then there is a perfect positive correlation between the two variables. When = -1, then
there is perfect negative correlation between the two variables. When = 0, then there is no
relationship or correlation between two variables. Theoretically, we get values which lie between
+1 and -1; but normally the value lies between +0.8 and -0.5.
1.6. Problem: Find the coefficient of correlation between the age of husbands (X) and the age of
wives (Y).
X 23 27 28 28 29 30 31 33 35 36
Y 18 20 22 27 21 29 27 29 28 29
Solution:
2. Covariance:
 In probability theory and statistics, covariance is a measure of the joint variability of
two random variables.
 If the greater values of one variable mainly correspond with the greater values of the
other variable, and the same holds for the lesser values, (i.e., the variables tend to show
similar behavior), the covariance is positive.
 In the opposite case, when the greater values of one variable mainly correspond to the
lesser values of the other, (i.e., the variables tend to show opposite behavior), the
covariance is negative.
 The sign of the covariance therefore shows the tendency in the linear
relationship between the variables.
2.1. The Covariance Formula:
The formula is:
Cov(X,Y) = Σ E((X-μ)E(Y-ν)) / n-1 where:
X is a random variable
E(X) = μ is the expected value (the mean) of the random variable X and
E(Y) = ν is the expected value (the mean) of the random variable Y
n = the number of items in the data set
Example: Calculate covariance for the following data set:
X: 2.1, 2.5, 3.6, 4.0 (mean = 3.1)
Y: 8, 10, 12, 14 (mean = 11)
Substitute the values into the formula and solve:
Cov(X,Y) = ΣE((X-μ)(Y-ν)) / n-1
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3
= 6.8/3
= 2.267
The result is positive, meaning that the variables are positively related.
2.2. Covariance in Excel: Overview
Covariance gives you a positive number if the variables are positively related. You’ll get a
negative number if they are negatively related. A high covariance basically indicates there is a
strong relationship between the variables. A low value means there is a weak relationship.
Covariance in Excel: (Steps)
Step 1: Enter your data into two columns in Excel. For example, type your X values into column
A and your Y values into column B.
Step 2: Click the “Data” tab and then click “Data analysis.” The Data Analysis window will
open.
Step 3: Choose “Covariance” and then click “OK.”
Step 4: Click “Input Range” and then select all of your data. Include column headers if you have
them.
Step 5: Click the “Labels in First Row” check box if you have included column headers in your
data selection.
Step 6: Select “Output Range” and then select an area on the worksheet. A good place to select
is an area just to the right of your data set.
Step 7: Click “OK.” The covariance will appear in the area you selected in Step 5.
3. Scatter Diagram: A scatter diagram is a graph that shows the relationship between two
variables. Scatter diagrams can demonstrate a relationship between any element of a process,
environment, or activity on one axis and a quality defect on the other axis.
3.1. Type of Scatter Diagram
According to the type of correlation, scatter diagrams can be divided into following categories:
 Scatter Diagram with No Correlation
 Scatter Diagram with Moderate Correlation
 Scatter Diagram with Strong Correlation
3.1.1. Scatter Diagram with No Correlation
This type of diagram is also known as “Scatter Diagram with Zero Degree of Correlation”.
In this type of scatter diagram, data points are spread so randomly that you cannot draw any line
through them.
In this case you can say that there is no relation between these two variables.
3.1.2. Scatter Diagram with Moderate Correlation
This type of diagram is also known as “Scatter Diagram with Low Degree of Correlation”.
Here, the data points are little closer together and you can feel that some kind of relation exists
between these two variables.
3.1.3. Scatter Diagram with Strong Correlation
This type of diagram is also known as “Scatter Diagram with High Degree of Correlation”.
In this diagram, data points are grouped very close to each other such that you can draw a line by
following their pattern.
In this case you will say that the variables are closely related to each other.
As discussed earlier, we can also divide the scatter diagram according to the slope, or trend, of
the data points:
 Scatter Diagram with Strong Positive Correlation
 Scatter Diagram with Weak Positive Correlation
 Scatter Diagram with Strong Negative Correlation
 Scatter Diagram with Weak Negative Correlation
 Scatter Diagram with Weakest (or no) Correlation
Strong positive correlation means there is a clearly visible upward trend from left to right; a
strong negative correlation means there is a clearly visible downward trend from left to right. A
weak correlation means the trend, up of down, is less clear. A flat line from left to right is the
weakest correlation, as it is neither positive nor negative and indicates the independent variable
does not affect the dependent variable.
3.1.4. Scatter Diagram with Strong Positive Correlation
This type of diagram is also known as Scatter Diagram with Positive Slant.
In positive slant, the correlation will be positive, i.e. as the value of x increases, the value of y
will also increase. You can say that the slope of straight line drawn along the data points will go
up. The pattern will resemble the straight line.
For example, if the temperature goes up, cold drink sales will also go up.
3.1.5. Scatter Diagram with Weak Positive Correlation
Here as the value of x increases the value of y will also tend to increase, but the pattern will not
closely resemble a straight line.
3.1.6. Scatter Diagram with Strong Negative Correlation
This type of diagram is also known as Scatter Diagram with Negative Slant.
In negative slant, the correlation will be negative, i.e. as the value of x increases, the value of y
will decrease. The slope of a straight line drawn along the data points will go down.
For example, if the temperature goes up, sales of winter coats goes down.
3.1.7. Scatter Diagram with Weak Negative Correlation
Here as the value of x increases the value of y will tend to decrease, but the pattern will not be as
well defined.
4. Dot Diagram:
 A dot diagram or dot plot is a statistical chart consisting of data points plotted on a
fairly simple scale, typically using filled in circles.
 The dot plot as a representation of a distribution consists of group of data points plotted
on a simple scale.
 Dot plots are used for continuous, quantitative, univariate data.
 Data points may be labelled if there are few of them.
 Dot plots are one of the simplest statistical plots, and are suitable for small to moderate
sized data sets.
 They are useful for highlighting clusters and gaps, as well as outliers.
 Their other advantage is the conservation of numerical information.
5. General Concept of Regression:
 Regression is the measures of the average relationship between two or more variables in
terms of the original units of the data.
 Estimation of regression is called regression analysis.
 In regression analysis two variables are involved. One variable is called dependent
variable and the other is called independent variable.
E.g. the yield of rice and rainfall are related. Yield of rice is a dependent variable and
rainfall is an independent variable.
5.1. Definitions:
 “Regression analysis attempts to establish the nature of the relationship between
variables, that is, to study the functional relationship between the variables and thereby
provide a mechanism for predicting or forecasting.” – Ya-Lun-Chow.
 “Regression is the measure of the average relationship between two or more variables in
terms of the original units of the data” – Blair.
5.2. Regression Lines:
 The graphic representation of regression is called regression line.
 One variable is represented as X and the other one as Y.
 In a simple linear regression, there are two regression lines constructed for the
relationship between two variables, say X and Y.
 One regression line shows regression of X upon Y and the other shows the regression of
Y upon X.
 When there is perfectly positive correlation (+1) or perfectly negative correlation (-1) the
two regression lines will coincide with each other i.e., there will be only one line.
 If the regression lines are nearer to each other, then there is a higher degree of correlation.
 If the two lines are farther away from each other, then there is lesser degree of
correlation.
 If = 0, both variables are independent. There is no correlation. So both will cut each
other at right angle.
5.3. Regression Coefficient & its Properties:
5.3.1. level-level model
The basic form of linear regression (without the residuals)
The basic formula for linear regression can be seen above. In the formula, y denotes the dependent
variable and x is the independent variable. For simplicity let’s assume that it is univariate
regression, but the principles obviously hold for the multivariate case as well.
To put it into perspective, let’s say that after fitting the model we receive:
Intercept (a)
 x is continuous and centered (by subtracting the mean of x from each observation, the
average of transformed x becomes 0) — average y is 3 when x is equal to the sample mean
 x is continuous, but not centered — average y is 3 when x = 0
 x is categorical — average y is 3 when x = 0 (this time indicating a category, more on this
below)
Coefficient (b)
 x is a continuous variable
Interpretation: a unit increase in x results in an increase in average y by 5 units, all other variables
held constant.
 x is a categorical variable
This requires a bit more explanation. Let’s say that x describes gender and can take values
(‘male’, ‘female’). Now let’s convert it into a dummy variable which takes values 0 for males and
1 for females.
Interpretation: average y is higher by 5 units for females than for males, all other variables held
constant.
5.3.2. log-level model
Log denotes the natural logarithm
Typically we use log transformation to pull outlying data from a positively skewed distribution
closer to the bulk of the data, in order to make the variable normally distributed. In the case of
linear regression, one additional benefit of using the log transformation is interpretability.
Example of log transformation: right — before, left — after. Source
As before, let’s say that the formula below presents the coefficients of the fitted model.
Intercept (a)
Interpretation is similar as in the vanilla (level-level) case, however, we need to take the exponent
of the intercept for interpretation exp(3) = 20.09. The difference is that this value stands for
the geometric mean of y as opposed to the arithmetic mean in case of the level-level model).
Coefficient (b)
The principles are again similar to the level-level model when it comes to interpreting
categorical/numeric variables. Analogically to the intercept, we need to take the exponent of the
coefficient: exp(b) = exp(0.01) = 1.01. This means that a unit increase in x causes a 1% increase
in average (geometric) y, all other variables held constant.
Two things worth mentioning here:
 There is a rule of thumb when it comes to interpreting coefficients of such a model. If abs(b)
< 0.15 it is quite safe to say that when b = 0.1 we will observe a 10% increase in y for a unit
change in x. For coefficients with larger absolute value, it is recommended to calculate the
exponent.
 When dealing with variables in [0, 1] range (like a percentage) it is more convenient for
interpretation to first multiply the variable by 100 and then fit the model. This way the
interpretation is more intuitive, as we increase the variable by 1 percentage point instead of
100 percentage points (from 0 to 1 immediately).
5.3.3. level-log model
Let’s assume that after fitting the model we receive:
The interpretation of the intercept is the same as in the case of the level-level model.
For the coefficient b — a 1% increase in x results in an approximate increase in
average y by b/100 (0.05 in this case), all other variables held constant.To get the exact amount,
we would need to take b× log(1.01), which in this case gives 0.0498.
5.3.4. log-log model
Let’s assume that after fitting the model we receive:
Once again focus on the interpretation of b. An increase in x by 1% results in 5% increase in
average (geometric) y, all other variables held constant. To obtain the exact amount, we need to
take
6. Standard Error:
 Standard error is the difference between the means of the population and its sample.
 Standard error is defined as the ratio of standard deviation of the sample divided by the
square root of the total number of observations.
Standard Error = SD / √N
SD = Standard Error
N = Total Number of Observations
Standard error is abbreviated as SE.
It is given in the same unit as the data.
6.1. Uses of Standard Error:
 It helps to understand the difference between two samples.
 It helps to calculate the size of the sample.
 To determine whether the sample is drawn from a known population or not.
Unit V: Statistical Hypothesis Testing
1. Making Assumption:
 Statistical hypothesis testing requires several assumptions.
 These assumptions include
 Considerations of the level of measurement of the variable.
 The method of sampling, the shape of the population distribution.
 The sample size.
 The specific assumptions may vary, depending on the test or the conditions of testing.
However, without exception, all statistical tests assume random sampling.
 As example, based on our data, we can test the hypothesis that the average price of gas in
California is higher than the average national price of gas. The test we are considering
meets these conditions:
 The sample of California gas stations was randomly selected.
 The variable price per gallon is measured at the interval-ratio level.
 We cannot assume that the population is normally distributed.
2. Statistical Hypotheses:
 A statistical hypothesis is an assumption about a population parameter.
 This assumption may or may not be true. Hypothesis testing refers to the formal
procedures used by statisticians to accept or reject statistical hypotheses.
 The best way to determine whether a statistical hypothesis is true would be to examine
the entire population.
 Since that is often impractical, researchers typically examine a random sample from the
population.
 If sample data are not consistent with the statistical hypothesis, the hypothesis is rejected.
There are two types of statistical hypotheses:
2.1. Null hypothesis: The null hypothesis, denoted by Ho, is usually the hypothesis that sample
observations result purely from chance.
2.2. Alternative hypothesis: The alternative hypothesis, denoted by H1 or Ha, is the hypothesis
that sample observations are influenced by some non-random cause.
For example, suppose we wanted to determine whether a coin was fair and balanced. A null
hypothesis might be that half the flips would result in Heads and half, in Tails. The alternative
hypothesis might be that the number of Heads and Tails would be very different. Symbolically,
these hypotheses would be expressed as:
Ho: P = 0.5
Ha: P ≠ 0.5
Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given this result, we
would be inclined to reject the null hypothesis. We would conclude, based on the evidence, that
the coin was probably not fair and balanced.
3. Hypothesis Tests
Statisticians follow a formal process to determine whether to reject a null hypothesis, based on
sample data. This process, called hypothesis testing, consists of four steps.
 State the hypotheses. This involves stating the null and alternative hypotheses. The
hypotheses are stated in such a way that they are mutually exclusive. That is, if one is
true, the other must be false.
 Formulate an analysis plan. The analysis plan describes how to use sample data to
evaluate the null hypothesis. The evaluation often focuses around a single test statistic.
 Analyze sample data. Find the value of the test statistic (mean score, proportion, t
statistic, z-score, etc.) described in the analysis plan.
 Interpret results. Apply the decision rule described in the analysis plan. If the value of the
test statistic is unlikely, based on the null hypothesis, reject the null hypothesis.
4. Errors in Hypothesis Testing
Two types of errors can result from a hypothesis test:
 Type I error. A Type I error occurs when the researcher rejects a null hypothesis when it
is true. The probability of committing a Type I error is called the significance level. This
probability is also called alpha, and is often denoted by α.
 Type II error. A Type II error occurs when the researcher fails to reject a null hypothesis
that is false. The probability of committing a Type II error is called Beta, and is often
denoted by β. The probability of committing a Type II error is called the Power of the
test.
5. Decision Making Rules
The analysis plan includes decision rules for rejecting the null hypothesis. In practice,
statisticians describe these decision rules in two ways - with reference to a P-value or with
reference to a region of acceptance.
 P-value: The strength of evidence in support of a null hypothesis is measured by the P-
value. Suppose the test statistic is equal to S. The P-value is the probability of observing a
test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less
than the significance level, we reject the null hypothesis.
 Region of acceptance: The region of acceptance is a range of values. If the test statistic
falls within the region of acceptance, the null hypothesis is not rejected. The region of
acceptance is defined so that the chance of making a Type I error is equal to the
significance level.
The set of values outside the region of acceptance is called the region of rejection. If the
test statistic falls within the region of rejection, the null hypothesis is rejected. In such
cases, we say that the hypothesis has been rejected at the α level of significance.
These approaches are equivalent. Some statistics texts use the P-value approach; others use the
region of acceptance approach. On this website, we tend to use the region of acceptance
approach.
6. One-Tailed and Two-Tailed Tests
A test of a statistical hypothesis, where the region of rejection is on only one side of the sampling
distribution, is called a one-tailed test. For example, suppose the null hypothesis states that the
mean is less than or equal to 10. The alternative hypothesis would be that the mean is greater
than 10. The region of rejection would consist of a range of numbers located on the right side of
sampling distribution; that is, a set of numbers greater than 10.
A test of a statistical hypothesis, where the region of rejection is on both sides of the sampling
distribution, is called a two-tailed test. For example, suppose the null hypothesis states that the
mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or greater
than 10. The region of rejection would consist of a range of numbers located on both sides of
sampling distribution; that is, the region of rejection would consist partly of numbers that were
less than 10 and partly of numbers that were greater than 10.
7. Confidence Interval:
 A confidence interval is how much uncertainty there is with any particular statistic.
Confidence intervals are often used with a margin of error.
 It states how confident one can be that the results from a poll or survey reflect what was
expected to find if it were possible to survey the entire population.
 Confidence intervals are intrinsically connected to confidence levels.
 Confidence intervals consist of a range of potential values of the unknown population
parameter.
 However, the interval computed from a particular sample does not necessarily include the
true value of the parameter.
 Based on the (usually taken) assumption that observed data are random samples from a
true population, the confidence interval obtained from the data is also random.
 The confidence level is designated prior to examining the data. Most commonly, the 95%
confidence level is used. However, other confidence levels can be used, for example,
90% and 99%.
 Factors affecting the width of the confidence interval include the size of the sample, the
confidence level, and the variability in the sample.
 A larger sample will tend to produce a better estimate of the population parameter, when
all other factors are equal.
 A higher confidence level will tend to produce a broader confidence interval.
Unit VI: Test of Significance:
1. Steps in Testing Statistical Significance:
1. The first step is to specify the null hypothesis. For a two-tailed test, the null hypothesis is
typically that a parameter equals zero although there are exceptions. A typical null
hypothesis is μ1 - μ2 = 0 which is equivalent to μ1= μ2. For a one-tailed test, the null
hypothesis is either that a parameter is greater than or equal to zero or that a parameter is
less than or equal to zero. If the prediction is that μ1 is larger than μ2, then the null
hypothesis (the reverse of the prediction) is μ2 - μ1 ≥ 0. This is equivalent to μ1 ≤ μ2.
2. The second step is to specify the α level which is also known as the significance level.
Typical values are 0.05 and 0.01.
3. The third step is to compute the probability value (also known as the p value). This is the
probability of obtaining a sample statistic as different or more different from the
parameter specified in the null hypothesis given that the null hypothesis is true.
4. Finally, compare the probability value with the α level. If the probability value is lower
then you reject the null hypothesis. Keep in mind that rejecting the null hypothesis is not
an all-or-none decision. The lower the probability value, the more confidence you can
have that the null hypothesis is false. However, if your probability value is higher than
the conventional α level of 0.05, most scientists will consider your findings inconclusive.
Failure to reject the null hypothesis does not constitute support for the null hypothesis. It
just means you do not have sufficiently strong data to reject it.
2. Sampling Distribution of Mean and Standard Error:
The sampling distribution of a statistic is the distribution of that statistic, considered as
a random variable, when derived from a random sample of size n. It may be considered as the
distribution of the statistic for all possible samples from the same population of a given sample
size. The sampling distribution depends on the underlying distribution of the population, the
statistic being considered, the sampling procedure employed, and the sample size used. There is
often considerable interest in whether the sampling distribution can be approximated by
an asymptotic distribution, which corresponds to the limiting case either as the number of
random samples of finite size, taken from an infinite population and used to produce the
distribution, tends to infinity, or when just one equally-infinite-size "sample" is taken of that
same population.
2.1. Standard Error:
The standard error (SE) is very similar to standard deviation. Both are measures of spread. The
higher the number, the more spread out your data is. To put it simply, the two terms are
essentially equal — but there is one important difference. While the standard error
uses statistics (sample data) standard deviations use parameters (population data). (What is
the difference between a statistic and a parameter?).
In statistics, you’ll come across terms like “the standard error of the mean” or “the standard error
of the median.” The SE tells you how far your sample statistic (like the sample mean) deviates
from the actual population mean. The larger your sample size, the smaller the SE. In other words,
the larger your sample size, the closer your sample mean is to the actual population mean.
2.2. SE Calculation:
How you find the standard error depends on what stat you need. For example, the calculation is
different for the mean or proportion. When you are asked to find the sample error, you’re
probably finding the standard error. That uses the following formula: s/√n. You might be asked
to find standard errors for other stats like the mean or proportion.
2.3. Standard Error Formula:
The following tables show how to find the standard deviation (first table) and SE (second table).
That assumes you know the right population parameters. If you don’t know the population
parameters, you can find the standard error:
 Sample mean.
 Sample proportion.
 Difference between means.
 Difference between proportions.
Parameter (Population) Formula for Standard Deviation.
Sample mean, = σ / sqrt (n)
Sample proportion, p = sqrt [P (1-P) / n)
Difference between means. = sqrt [σ2
1/n1 + σ2
2/n2]
Difference between proportions. = sqrt [P1(1-P1)/n1 + P2(1-P2)/n2]
Statistic (Sample) Formula for Standard Error.
Sample mean, = s / sqrt (n)
Sample proportion, p = sqrt [p (1-p) / n)
Difference between means. = sqrt [s2
1/n1 + s2
2/n2]
Difference between proportions. = sqrt [p1(1-p1)/n1 + p2(1-p2)/n2]
Key for above tables:
P = Proportion of successes. Population.
p = Proportion of successes. Sample.
n = Number of observations. Sample.
n2 = Number of observations. Sample 1.
n2 = Number of observations. Sample 2.
σ2
1 = Variance. Sample 1.
σ2
2 = Variance. Sample 2.
2.4. Sampling Distribution of the Mean:
Definition: The Sampling Distribution of the Mean is the mean of the population from where
the items are sampled. If the population distribution is normal, then the sampling distribution of
the mean is likely to be normal for the samples of all sizes.
Following are the main properties of the sampling distribution of the mean:
 Its mean is equal to the population mean, thus,
(?X͞ =sample mean and ?p Population mean)
 The population standard deviation divided by the square root of the sample size is equal to the
standard deviation of the sampling distribution of the mean, thus:
(σ = population standard deviation, n = sample size)
 The sampling distribution of the mean is normally distributed. This means, the distribution of
sample means for a large sample size is normally distributed irrespective of the shape of the
universe, but provided the population standard deviation (σ) is finite. Generally, the sample
size 30 or more is considered large for the statistical purposes. If the population is normal,
then the distribution of sample means will be normal, irrespective of the sample size.
σ͞x is a measure of precision through which the sample mean can be used to estimate the true
value of a population mean. ?σ͞x varies in direct proportion to the change in the original
population and inversely to the square of sample size ‘n’. Thus, the greater the variations in the
original items of the population greater the variation expected in sampling error in using ͞x as an
estimate of ?. It is to be noted that larger the sample size smaller is the standard error and vice-
versa.
3. Large Sample Tests:
 Some researchers choose to increase their sample size if they have an effect which is
almost within significance level.
 This is done since the researcher suspects that he is short of samples, rather than that
there is no effect there. We need to be careful using this method, as it increases the
chances of creating a false positive result.
 When we have a higher sample size, the likelihood of encountering Type-I and Type-II
errors occurring reduces, at least if other parts of our study is carefully constructed and
problems avoided.
 Higher sample size allows the researcher to increase the significance level of the findings,
since the confidence of the result are likely to increase with a higher sample size.
 This is to be expected because larger the sample size, the more accurately it is expected
to mirror the behavior of the whole group.
 Therefore if you want to reject your null hypothesis, then you should make sure your
sample size is at least equal to the sample size needed for the statistical significance
chosen and expected effects.
4. Z- Test:
 A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution.
 Because of the central limit theorem, many test statistics are approximately normally
distributed for large samples.
 For each significance level, the Z-test has a single critical value (for example, 1.96 for
5% two tailed) which makes it more convenient than the Student's t-test which has
separate critical values for each sample size.
 Therefore, many statistical tests can be conveniently performed as approximate Z-tests if
the sample size is large or the population variance is known.
 If the population variance is unknown (and therefore has to be estimated from the sample
itself) and the sample size is not large (n < 30), the Student's t-test may be more
appropriate.
 If T is a statistic that is approximately normally distributed under the null hypothesis, the
next step in performing a Z-test is to estimate the expected value θ of T under the null
hypothesis, and then obtain an estimate s of the standard deviation of T.
 After that the standard score Z = (T − θ) / s is calculated, from which one-tailed and two-
tailed p-values can be calculated as Φ(−Z) (for upper-tailed tests), Φ(Z) (for lower-tailed
tests) and 2Φ(−|Z|) (for two-tailed tests) where Φ is the standard normal cumulative
distribution function.
5. T- Test:
 When the difference between two population averages is being investigated, a t test is
used.
 In other words, a t test is used when we wish to compare two means (the scores must be
measured on an interval or ratio measurement scale). For example, we would use a t test
if we wished to compare the reading achievement of boys and girls.
 With a t test, we have one independent variable and one dependent variable. The
independent variable (gender in this case) can only have two levels (male and female).
The dependent variable would be reading achievement. If the independent had more than
two levels, then we would use a one-way analysis of variance (ANOVA).
 The test statistic that a t test produces is a t-value. Conceptually, t-values are an extension
of z-scores. In a way, the t-value represents how many standard units the means of the
two groups are apart.
 With a t test, the researcher wants to state with some degree of confidence that the
obtained difference between the means of the sample groups is too great to be a chance
event and that some difference also exists in the population from which the sample was
drawn.
 In other words, the difference that we might find between the boys’ and girls’ reading
achievement in our sample might have occurred by chance, or it might exist in the
population.
 If our t test produces a t-value that results in a probability of .01, we say that the
likelihood of getting the difference we found by chance would be 1 in a 100 times.
 We could say that it is unlikely that our results occurred by chance and the difference we
found in the sample probably exists in the populations from which it was drawn.
5.1. Paired and Unpaired T- test:
 T-tests are useful for comparing the means of two samples. There are two types: paired
and unpaired.
 Paired means that both samples consist of the same test subjects. A paired t-test is
equivalent to a one-sample t-test.
 Unpaired means that both samples consist of distinct test subjects. An unpaired t-test is
equivalent to a two-sample t-test.
 For example, if you wanted to conduct an experiment to see how drinking an energy
drink increases heart rate, you could do it two ways.
 The "paired" way would be to measure the heart rate of 10 people before they drink the
energy drink and then measure the heart rate of the same 10 people after drinking the
energy drink. These two samples consist of the same test subjects, so you would perform
a paired t-test on the means of both samples.
 The "unpaired" way would be to measure the heart rate of 10 people before drinking an
energy drink and then measure the heart rate of some other group of people who have
drank energy drinks. These two samples consist of different test subjects, so you would
perform an unpaired t-test on the means of both samples.
6. Parametric and Non parametric tests:
6.1. Definition of Parametric Test
The parametric test is the hypothesis test which provides generalizations for making statements
about the mean of the parent population. A t-test based on Student’s t-statistic, which is often
used in this regard.
The t-statistic rests on the underlying assumption that there is the normal distribution of variable
and the mean in known or assumed to be known. The population variance is calculated for the
sample. It is assumed that the variables of interest, in the population are measured on an interval
scale.
6.2. Definition of Nonparametric Test
The nonparametric test is defined as the hypothesis test which is not based on underlying
assumptions, i.e. it does not require population’s distribution to be denoted by specific
parameters.
The test is mainly based on differences in medians. Hence, it is alternately known as the
distribution-free test. The test assumes that the variables are measured on a nominal or ordinal
level. It is used when the independent variables are non-metric.
In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that it is
equally likely that a randomly selected value from one sample will be less than or greater than a
randomly selected value from a second sample.
Key Differences between Parametric and Non-parametric Tests
The fundamental differences between parametric and nonparametric test are discussed in the
following points:
1. A statistical test, in which specific assumptions are made about the population parameter,
is known as the parametric test. A statistical test used in the case of non-metric
independent variables is called nonparametric test.
2. In the parametric test, the test statistic is based on distribution. On the other hand, the test
statistic is arbitrary in the case of the nonparametric test.
3. In the parametric test, it is assumed that the measurement of variables of interest is done
on interval or ratio level. As opposed to the nonparametric test, wherein the variable of
interest are measured on nominal or ordinal scale.
4. In general, the measure of central tendency in the parametric test is mean, while in the
case of the nonparametric test is median.
5. In the parametric test, there is complete information about the population. Conversely, in
the nonparametric test, there is no information about the population.
6. The applicability of parametric test is for variables only, whereas nonparametric test
applies to both variables and attributes.
7. For measuring the degree of association between two quantitative variables, Pearson’s
coefficient of correlation is used in the parametric test, while spearman’s rank correlation
is used in the nonparametric test.
7. Chi Square Test:
 A chi-squared test, also written as χ2
test, is any statistical hypothesis test where
the sampling distribution of the test statistic is a chi-squared distribution when the null
hypothesis is true. Without other qualification, 'chi-squared test' often is used as short
for Pearson's chi-squared test.
 The chi-squared test is used to determine whether there is a significant difference
between the expected frequencies and the observed frequencies in one or more
categories.
 In the standard applications of this test, the observations are classified into mutually
exclusive classes, and there is some theory, or say null hypothesis, which gives the
probability that any observation falls into the corresponding class.
 The purpose of the test is to evaluate how likely the observations that are made would be,
assuming the null hypothesis is true.
 Chi-squared tests are often constructed from a sum of squared errors, or through
the sample variance.
 Test statistics that follow a chi-squared distribution arise from an assumption of
independent normally distributed data, which is valid in many cases due to the central
limit theorem.
 A chi-squared test can be used to attempt rejection of the null hypothesis that the data are
independent.
Unit VII: Experimental Designs
1. Principles of Experimental Design:
The basic principles of experimental design are (i) Randomization, (ii) Replication and
(iii) Local Control.
1.1. Randomization:
Randomization is the cornerstone underlying the use of statistical methods in experimental
designs. Randomization is the random process of assigning treatments to the experimental units.
The random process implies that every possible allotment of treatments has the same probability.
For example, if number of treatment = 3 (say, A, B, and C) and replication = r = 4, then the
number of elements = t * r = 3 * 4 = 12 = n. Replication means that each treatment will appear 4
times as r = 4. Let the design is
A C B C
C B A B
A C B A
Note from the design elements 1, 7, 9, 12 are reserved for treatment A, element 3, 6, 8 and 11 are
reserved for Treatment B and elements 2, 4, 5 and 10 are reserved for Treatment C. P(A)= 4/12,
P(B)= 4/12, and P(C)=4/12, meaning that Treatment A, B, and C have equal chances of its
selection.
1.2. Replication:
The second principle of an experimental design is replication, which is a repetition of the
basic experiment. In other words, it is a complete run for all the treatments to be tested in the
experiment. In all experiments, some kind of variation is introduced because of the fact that
the experimental units such as individuals or plots of land in agricultural experiments cannot
be physically identical. This type of variation can be removed by using a number of
experimental units. We therefore perform the experiment more than once, i.e., we repeat the
basic experiment. An individual repetition is called a replicate. The number, the shape and
the size of replicates depend upon the nature of the experimental material. A replication is
used to:
(i) Secure a more accurate estimate of the experimental error, a term which represents the
differences that would be observed if the same treatments were applied several times to the
same experimental units;
(ii) Decrease the experimental error and thereby increase precision, which is a measure of the
variability of the experimental error; and
1.3. Local Control:
It has been observed that all extraneous source of variation is not removed by randomization and
replication, i.e. unable to control the extraneous source of variation.
Thus we need to a refinement in the experimental technique. In other words, we need to choose a
design in such a way that all extraneous source of variation is brought under control. For this
purpose we make use of local control, a term referring to the amount of (i) balancing, (ii)
blocking and (iii) grouping of experimental units.
Balancing: Balancing means that the treatment should be assigned to the experimental units in
such a way that the result is a balanced arrangement of treatment.
Blocking: Blocking means that the like experimental units should be collected together to far
relatively homogeneous groups. A block is also a replicate.
The main objective/ purpose of local control is to increase the efficiency of experimental design
by decreasing the experimental error.
2. Longitudinal Study:
 A longitudinal study (or longitudinal survey, or panel study) is a research design that
involves repeated observations of the same variables (e.g., people) over short or long
periods of time (i.e., uses longitudinal data).
 It is often a type of observational study, although they can also be structured as
longitudinal randomized experiments.
 Longitudinal studies are often used in social-personality and clinical psychology, to study
rapid fluctuations in behaviors, thoughts, and emotions from moment to moment or day
to day; in developmental psychology, to study developmental trends across the life span.
 Longitudinal studies can be retrospective (looking back in time, thus using existing data
such as medical records or claims database) or prospective (requiring the collection of
new data).
3. Cross Sectional Study:
 In medical research and social science, a cross-sectional study (also known as a cross-
sectional analysis, transverse study, prevalence study) is a type of observational study
that analyzes data from a population, or a representative subset, at a specific point in
time—that is, cross-sectional data.
 In medical research, cross-sectional studies differ from case-control studies in that they
aim to provide data on the entire population under study, whereas case-control studies
typically include only individuals with a specific characteristic, with a sample, often a
tiny minority, of the rest of the population.
 Cross-sectional studies are descriptive studies (neither longitudinal nor experimental).
 The study may be used to describe some feature of the population, such as prevalence of
an illness, or they may support inferences of cause and effect.
 Longitudinal studies differ from both in making a series of observations more than once
on members of the study population over a period of time.
4. Prospective and Retrospective Study:
4.1. Prospective study
 It is an epidemiologic study in which the groups of individuals (cohorts) are selected
on the bases offactors that are to be examined for possible effects on some outcome.
 For example, the effect of exposure to a specificrisk factor on the eventual development
of a particular disease can be studied.
 The cohorts are then followed over aperiod of time to determine the incidence rates of the
outcomes being studied as they relate to the original factors. Called also cohort study.
The term prospective usually implies a cohort selected in the present and followed into
the future, but this method can also be applied to existing longitudinal historical data,
such as insurance or medical records.
 A cohort is identified and classified as to exposure to the risk factor at some date in the
past and followed up to the present to determine incidence rates. This is called a
historical prospective study, prospective study of past data, or retrospective cohort study.
4.2. Retrospective study:
 It is an epidemiologic study in which participating individuals are classified
as either having some outcome (cases) or lacking it (controls).
 The outcome may be a specific disease, and the persons' histories are examined for
specific factors that might be associated with that outcome.
 Cases and controls are often matched with respect tocertain demographic or other
variables but need not be.
 As compared to prospective studies, retrospective studies suffer from drawbacks: certain
important statistics cannot be measured, and large biases may be introduced both in the
selection of controls and in the recall of past exposure to risk factors.
 The advantage of the retrospective study is its smallscale, usually short time for
completion, and its applicability to rare diseases, which would require study of very large
cohorts in prospective studies.
5. Randomized Block:
 The blocks method was introduced by S. Bernstein.
 In the statistical theory of the design of experiments, blocking is the arranging
of experimental units in groups (blocks) that are similar to one another.
 Typically, a blocking factor is a source of variability that is not of primary interest to the
experimenter.
 In Probability Theory the blocks method consists of splitting a sample into blocks
(groups) separated by smaller sub-blocks so that the blocks can be considered almost
independent.
 The blocks method helps proving limit theorems in the case of dependent random
variables.
Example:
Gender
Treatment
Placebo Vaccine
Male 250 250
Female 250 250
Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly
assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the
placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.
6. Simple Factorial Design:
 Factorial design is one of the many experimental designs used in psychological
experiments where two or more independent variables are simultaneously manipulated to
observe their effects on the dependent variables.
 A simple factorial design is an experimental design where 2 or more levels of
each variable are observed in combination 2 or more levels of each variable.
Example:
A university wants to assess the starting salaries of their MBA graduates. The study looks at
graduates working in four different employment areas: accounting, management, finance, and
marketing. In addition to looking at the employment sector, the researchers also look at gender.
In this example, the employment sector and gender of the graduates are the independent
variables, and the starting salaries are the dependent variables. This would be considered a 4×2
factorial design.
7. Analysis of Variance (ANOVA):
 Analysis of Variance (ANOVA) is a statistical method used to test differences between
two or more means. It may seem odd that the technique is called “Analysis of Variance”
rather than “Analysis of Means.”
 As we can see, the name is appropriate because inferences about means are made by
analyzing variance. ANOVA is used to test general rather than specific differences
among means.
 An ANOVA conducted on a design in which there is only one factor is called a ONE-
WAY ANOVA.
 If an experiment has two factors, then the ANOVA is called a TWO-WAY ANOVA.
Example: Suppose an experiment on the effects of age and gender on reading speed were
conducted using three age groups (8 years, 10 years, and 12 years) and the two genders (male
and female). The factors would be age and gender. Age would have three levels and gender
would have two levels.
8. Analysis of RBD:
 A Reliability Block Diagram (RBD) is a graphical representation of the components of
the system and how they are reliability-wise related.
 The diagram represents the functioning state (i.e., success or failure) of the system in
terms of the functioning states of its components.
 For example, a simple series configuration indicates that all of the components must
operate for the system to operate; a simple parallel configuration indicates that at least
one of the components must operate, and so on.
 When we define the reliability characteristics of each component, we can use software to
calculate the reliability function for the entire system and obtain a wide variety of system
reliability analysis results, including the ability to identify critical components and
calculate the optimum reliability allocation strategy to meet a system reliability goal.
9. Meta-analysis:
 A meta-analysis is a statistical analysis that combines the results of multiple scientific
studies.
 Meta-analysis can be performed when there are multiple scientific studies addressing the
same question, with each individual study reporting measurements that are expected to
have some degree of error.
 The aim then is to use approaches from statistics to derive a pooled estimate closest to the
unknown common truth based on how this error is perceived.
 Existing methods for meta-analysis yield a weighted average from the results of the
individual studies, and what differs is the manner in which these weights are allocated
and also the manner in which the uncertainty is computed around the point estimate thus
generated.
 In addition to providing an estimate of the unknown common truth, meta-analysis has the
capacity to contrast results from different studies and identify patterns among study
results, sources of disagreement among those results, or other interesting relationships
that may come to light in the context of multiple studies.
10. Systematic Review:
 Systematic reviews are a type of literature review that uses systematic methods to collect
secondary data, critically appraise research studies, and synthesize findings qualitatively
or quantitatively.
 Systematic reviews formulate research questions that are broad or narrow in scope, and
identify and synthesize studies that directly relate to the systematic review question.
 They are designed to provide a complete, exhaustive summary of current evidence
relevant to a research question.
 For example, systematic reviews of randomized controlled trials are key to the practice
of evidence-based medicine, and a review of existing studies is often quicker and cheaper
than embarking on a new study.
 While systematic reviews are often applied in the biomedical or healthcare context, they
can be used in other areas where an assessment of a precisely defined subject would be
helpful.
 Systematic reviews may examine clinical tests, public health interventions,
environmental interventions, social interventions, adverse effects, and economic
evaluations.
11. Ethics in Statistics:
Good statistical practice is fundamentally based on transparent assumptions, reproducible results,
and valid interpretations. In some situations, guideline principles may conflict, requiring
individuals to prioritize principles according to context. However, in all cases, stakeholders have
an obligation to act in good faith, to act in a manner that is consistent with these guidelines, and
to encourage others to do the same. Above all, professionalism in statistical practice presumes
the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical
ends is inherently unethical.
Ethical statistical practice does not include, promote, or tolerate any type of professional or
scientific misconduct, including, but not limited to, bullying, sexual or other harassment,
discrimination based on personal characteristics, or other forms of intimidation.
A. Professional Integrity and Accountability: The ethical statistician uses methodology and
data that are relevant and appropriate; without favoritism or prejudice; and in a manner intended
to produce valid, interpretable, and reproducible results. The ethical statistician does not
knowingly accept work for which he/she is not sufficiently qualified, is honest with the client
about any limitation of expertise, and consults other statisticians when necessary or in doubt. It is
essential that statisticians treat others with respect.
The ethical statistician:
1. Identifies and mitigates any preferences on the part of the investigators or data providers that
might predetermine or influence the analyses/results.
2. Employs selection or sampling methods and analytic approaches appropriate and valid for the
specific question to be addressed, so that results extend beyond the sample to a population
relevant to the objectives with minimal error under reasonable assumptions.
3. Respects and acknowledges the contributions and intellectual property of others.
4. When establishing authorship order for posters, papers, and other scholarship, strives to make
clear the basis for this order, if determined on grounds other than intellectual contribution.
5. Discloses conflicts of interest, financial and otherwise, and manages or resolves them
according to established (institutional/regional/local) rules and laws.
6. Accepts full responsibility for his/her professional performance. Provides only expert
testimony, written work, and oral presentations that he/she would be willing to have peer
reviewed.
7. Exhibits respect for others and, thus, neither engages in nor condones discrimination based on
personal characteristics; bullying; unwelcome physical, including sexual, contact; or other forms
of harassment or intimidation, and takes appropriate action when aware of such unethical
practices by others.
B. Integrity of data and methods: The ethical statistician is candid about any known or
suspected limitations, defects, or biases in the data that may affect the integrity or reliability of
the statistical analysis. Objective and valid interpretation of the results requires that the
underlying analysis recognizes and acknowledges the degree of reliability and integrity of the
data.
The ethical statistician:
1. Acknowledges statistical and substantive assumptions made in the execution and interpretation
of any analysis. When reporting on the validity of data used, acknowledges data editing
procedures, including any imputation and missing data mechanisms.
2. Reports the limitations of statistical inference and possible sources of error.
3. In publications, reports, or testimony, identifies who is responsible for the statistical work if it
would not otherwise be apparent.
4. Reports the sources and assessed adequacy of the data, accounts for all data considered in a
study, and explains the sample(s) actually used.
5. Clearly and fully reports the steps taken to preserve data integrity and valid results.
6. Where appropriate, addresses potential confounding variables not included in the study.
7. In publications and reports, conveys the findings in ways that are both honest and meaningful
to the user/reader. This includes tables, models, and graphics.
8. In publications or testimony, identifies the ultimate financial sponsor of the study, the stated
purpose, and the intended use of the study results.
9. When reporting analyses of volunteer data or other data that may not be representative of a
defined population, includes appropriate disclaimers and, if used, appropriate weighting.
10. To aid peer review and replication, shares the data used in the analyses whenever
possible/allowable and exercises due caution to protect proprietary and confidential data,
including all data that might inappropriately reveal respondent identities.
11. Strives to promptly correct any errors discovered while producing the final report or after
publication. As appropriate, disseminates the correction publicly or to others relying on the
results.
C. Responsibilities to Science/Public/Funder/Client: The ethical statistician supports valid
inferences, transparency, and good science in general, keeping the interests of the public, funder,
client, or customer in mind (as well as professional colleagues, patients, the public, and the
scientific community).
The ethical statistician:
1. To the extent possible, presents a client or employer with choices among valid alternative
statistical approaches that may vary in scope, cost, or precision.
2. Strives to explain any expected adverse consequences of failure to follow through on an
agreed-upon sampling or analytic plan.
3. Applies statistical sampling and analysis procedures scientifically, without predetermining the
outcome.
4. Strives to make new statistical knowledge widely available to provide benefits to society at
large and beyond his/her own scope of applications.
5. Understands and conforms to confidentiality requirements of data collection, release, and
dissemination and any restrictions on its use established by the data provider (to the extent
legally required), protecting use and disclosure of data accordingly. Guards privileged
information of the employer, client, or funder.
D. Responsibilities to Research Subjects: The ethical statistician protects and respects the
rights and interests of human and animal subjects at all stages of their involvement in a project.
This includes respondents to the census or to surveys, those whose data are contained in
administrative records and subjects of physically or psychologically invasive research.
The ethical statistician:
Biostatistical methods
Biostatistical methods

More Related Content

What's hot

Basic measurements in epidemiology
Basic measurements in epidemiologyBasic measurements in epidemiology
Basic measurements in epidemiology
Rizwan S A
 
Confidence intervals
Confidence intervalsConfidence intervals
Confidence intervals
Tanay Tandon
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
shivamdixit57
 
Lec. biostatistics introduction
Lec. biostatistics  introductionLec. biostatistics  introduction
Lec. biostatistics introduction
Riaz101
 
Measures of association
Measures of associationMeasures of association
Measures of association
IAU Dent
 
Hypothesis
HypothesisHypothesis
Biostatistics ppt
Biostatistics  pptBiostatistics  ppt
Biostatistics ppt
santhoshikayithi
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Muhammadasif909
 
Biostatistics Concept & Definition
Biostatistics Concept & DefinitionBiostatistics Concept & Definition
Biostatistics Concept & Definition
Southern Range, Berhampur, Odisha
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Kaori Kubo Germano, PhD
 
Biostatics
BiostaticsBiostatics
Biostatics
Osama Zahid
 
Anova ONE WAY
Anova ONE WAYAnova ONE WAY
Anova ONE WAY
elulu123
 
Epidemiological modelling
Epidemiological modellingEpidemiological modelling
Epidemiological modelling
Sumit Das
 
Introduction of biostatistics
Introduction of biostatisticsIntroduction of biostatistics
Introduction of biostatistics
khushbu
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
o_devinyak
 
Application of Biostatistics
Application of BiostatisticsApplication of Biostatistics
Application of Biostatistics
Jippy Jack
 
Histogram
HistogramHistogram
Study designs in epidemiology
Study designs in epidemiologyStudy designs in epidemiology
Study designs in epidemiology
Bhoj Raj Singh
 
Introduction to Biostatistics
Introduction to BiostatisticsIntroduction to Biostatistics
Introduction to Biostatistics
Abdul Wasay Baloch
 
Sir presentation
Sir presentationSir presentation
Sir presentation
LabartinosAllan
 

What's hot (20)

Basic measurements in epidemiology
Basic measurements in epidemiologyBasic measurements in epidemiology
Basic measurements in epidemiology
 
Confidence intervals
Confidence intervalsConfidence intervals
Confidence intervals
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
 
Lec. biostatistics introduction
Lec. biostatistics  introductionLec. biostatistics  introduction
Lec. biostatistics introduction
 
Measures of association
Measures of associationMeasures of association
Measures of association
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Biostatistics ppt
Biostatistics  pptBiostatistics  ppt
Biostatistics ppt
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Biostatistics Concept & Definition
Biostatistics Concept & DefinitionBiostatistics Concept & Definition
Biostatistics Concept & Definition
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Biostatics
BiostaticsBiostatics
Biostatics
 
Anova ONE WAY
Anova ONE WAYAnova ONE WAY
Anova ONE WAY
 
Epidemiological modelling
Epidemiological modellingEpidemiological modelling
Epidemiological modelling
 
Introduction of biostatistics
Introduction of biostatisticsIntroduction of biostatistics
Introduction of biostatistics
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
 
Application of Biostatistics
Application of BiostatisticsApplication of Biostatistics
Application of Biostatistics
 
Histogram
HistogramHistogram
Histogram
 
Study designs in epidemiology
Study designs in epidemiologyStudy designs in epidemiology
Study designs in epidemiology
 
Introduction to Biostatistics
Introduction to BiostatisticsIntroduction to Biostatistics
Introduction to Biostatistics
 
Sir presentation
Sir presentationSir presentation
Sir presentation
 

Similar to Biostatistical methods

Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg Girls High
 
3.1 Measures of center
3.1 Measures of center3.1 Measures of center
3.1 Measures of center
Long Beach City College
 
Rj Prashant's ppts on statistics
Rj Prashant's ppts on statisticsRj Prashant's ppts on statistics
Rj Prashant's ppts on statistics
Rj Prashant Kumar Dwivedi
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
Mona Sajid
 
Statistics -copy_-_copy[1]
Statistics  -copy_-_copy[1]Statistics  -copy_-_copy[1]
Statistics -copy_-_copy[1]
akshit123456789
 
Edited economic statistics note
Edited economic statistics noteEdited economic statistics note
Edited economic statistics note
haramaya university
 
Machine learning pre requisite
Machine learning pre requisiteMachine learning pre requisite
Machine learning pre requisite
Ram Singh
 
Statistics digital text book
Statistics digital text bookStatistics digital text book
Statistics digital text book
deepuplr
 
statistics class 11
statistics class 11statistics class 11
statistics class 11
ShivangBansal6
 
Statistics
StatisticsStatistics
Statistics
itutor
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdf
RaRaRamirez
 
Biostats in ortho
Biostats in orthoBiostats in ortho
Biostats in ortho
Raunak Manjeet
 
2.1 frequency distributions for organizing and summarizing data
2.1 frequency distributions for organizing and summarizing data2.1 frequency distributions for organizing and summarizing data
2.1 frequency distributions for organizing and summarizing data
Long Beach City College
 
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfMSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
Suchita Rawat
 
STATISTICS +1.pptx
STATISTICS +1.pptxSTATISTICS +1.pptx
STATISTICS +1.pptx
AjayPM4
 
Quantitative techniques in geography
Quantitative techniques in geographyQuantitative techniques in geography
Quantitative techniques in geography
DalbirAntil
 
Descriptive
DescriptiveDescriptive
Descriptive
Mmedsc Hahm
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
MuhammadNafees42
 
Quatitative Data Analysis
Quatitative Data Analysis Quatitative Data Analysis
Quatitative Data Analysis
maneesh mani
 
Unit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptxUnit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptx
Malla Reddy University
 

Similar to Biostatistical methods (20)

Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
 
3.1 Measures of center
3.1 Measures of center3.1 Measures of center
3.1 Measures of center
 
Rj Prashant's ppts on statistics
Rj Prashant's ppts on statisticsRj Prashant's ppts on statistics
Rj Prashant's ppts on statistics
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
 
Statistics -copy_-_copy[1]
Statistics  -copy_-_copy[1]Statistics  -copy_-_copy[1]
Statistics -copy_-_copy[1]
 
Edited economic statistics note
Edited economic statistics noteEdited economic statistics note
Edited economic statistics note
 
Machine learning pre requisite
Machine learning pre requisiteMachine learning pre requisite
Machine learning pre requisite
 
Statistics digital text book
Statistics digital text bookStatistics digital text book
Statistics digital text book
 
statistics class 11
statistics class 11statistics class 11
statistics class 11
 
Statistics
StatisticsStatistics
Statistics
 
Chapter 4 MMW.pdf
Chapter 4 MMW.pdfChapter 4 MMW.pdf
Chapter 4 MMW.pdf
 
Biostats in ortho
Biostats in orthoBiostats in ortho
Biostats in ortho
 
2.1 frequency distributions for organizing and summarizing data
2.1 frequency distributions for organizing and summarizing data2.1 frequency distributions for organizing and summarizing data
2.1 frequency distributions for organizing and summarizing data
 
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdfMSC III_Research Methodology and Statistics_Descriptive statistics.pdf
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
 
STATISTICS +1.pptx
STATISTICS +1.pptxSTATISTICS +1.pptx
STATISTICS +1.pptx
 
Quantitative techniques in geography
Quantitative techniques in geographyQuantitative techniques in geography
Quantitative techniques in geography
 
Descriptive
DescriptiveDescriptive
Descriptive
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
 
Quatitative Data Analysis
Quatitative Data Analysis Quatitative Data Analysis
Quatitative Data Analysis
 
Unit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptxUnit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptx
 

Recently uploaded

Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 

Recently uploaded (20)

Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 

Biostatistical methods

  • 1. Unit I Introduction 1. Data: The values recorded in an experiment or observation is called data. 1.1. Types of Data: 1.1.1. Primary Data: The data collected by an investigator is called primary data. It is first hand information. 1.1.2. Secondary Data: The data collected from another source is called secondary data. Eg. Data collected from newspapers, journals etc. 2. Biological Data: Biological data are data or measurements collected from biological sources, which are often stored or exchanged in a digital form. Eg. Examples of biological data are DNA base-pair sequences, and population data used in ecology. 2.1. Data Measurement Scale: There are four data measurement scales. 2.1.1. Nominal Scale: Nominal scales are used for labeling variables, without any quantitative value. “Nominal” scales could simply be called “labels.” Here are some examples, below. Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called “dichotomous.” 2.1.2. Ordinal Scale: Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc. Examples of ordinal scales.
  • 2. 2.2. Types of Biological Data: 2.2.1. Continuous Data: Continuous data are having value between two specified values; it is called a continuous data.  It is not countable but measurable.  Values can be in integers or in decimal.  It is a qualitative data.  It has infinite or unlimited number of values. E.g. Height of students, weight of students etc. 2.2.2. Discrete Data: The values are countable.  Data is in the form of integers (whole number) and not decimal.  It is quantitative.  It has finite number of values.  It is discontinuous. E.g. Number of absentees in a class. 3. Graphical Distribution: Presenting data in the form of graph is called graphic presentation of data. 3.1. Graph:  A graph is the geometric image of a data.  A graph is a diagram consisting of lines of statistical data.  A graph has two intersecting lines called axes.  The horizontal line is called X-axis. The vertical line is called Y-axis. 3.2. Frequency Distribution Graphs: Graphs obtained by plotting grouped data are called frequency distribution graphs.
  • 3. 3.2.1. Histogram: Histogram is a graph containing frequencies in the form of vertical rectangles. It is an area diagram.  It is a graphical presentation of frequency distribution.  The X – axis is marked with class intervals.  The Y – axis is marked with frequencies.  Vertical rectangles are drawn as per the height of the frequency of each class. Rectangles are drawn without any gap in between.  Histogram is a two dimensional diagram. Fig: Example of a histogram 3.2.2. Bar Graph:  A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.  The bars can be plotted vertically or horizontally.  A vertical bar chart is sometimes called a line graph.  A bar graph shows comparisons among discrete categories.
  • 4. Fig: Example of a Bar Graph 3.2.3. Box Plot:  In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles.  Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.  Outliers may be plotted as individual points.  Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution Fig: Example of a Box Plot
  • 5. 3.2.4. Frequency Polygon:  Polygon is a histogram with straight lines joining the midpoints of the top of the rectangles.  Polygon means a figure with many angles.  It is an area diagram. Polygon is a graph. It is the graphical representation of frequency distribution.  The X – axis is marked with class intervals.  The Y – axis is marked with frequencies.  The mid points of the top of the rectangles are joined by straight lines. 3.2.4.1. Uses of Frequency Polygon:  It simplifies a complex data.  It gives an idea of the pattern of distribution of variables in the population.  It facilitates comparison of two or more frequency distribution on the same graph.  It gives a clear picture of the same data. Fig: Example of a Frequency Polygon 3.3. Cumulative Frequency Distribution: The cumulative frequency distribution is a statistical table where the frequencies of preceding classes are added. As per example: Class Frequency 0 – 9 10 – 19 20 – 29 30 – 39 3 9 11 7 Total Frequency 30 Table: Continuous Frequency Distribution
  • 6. Class Frequency Cumulative Frequency 0 – 9 10 – 19 20 – 29 30 – 39 3 9 11 7 3 12 23 30 Table: Cumulative Frequency Distribution 4. Population:  In biology, a population is all the organisms of the same group or species, which live in a particular geographical area, and have the capability of interbreeding.  The area of a sexual population is the area where inter-breeding is potentially possible between any pair within the area, and where the probability of interbreeding is greater than the probability of cross-breeding with individuals from other areas. Fig: The distribution of human world population in 1994 5. Sampling:  Sampling is a method of collection of data.  Sample is a representative fraction of a population.  When the population is very large or infinite, sampling is the suitable method for data collection.  Example: The Oxygen content of pond water can be found by titrating just 100 ml of water.
  • 7.  There are two types of sampling, namely 1. Random Sampling. 2. Non-random Sampling. 5.1. Random Sampling:  In random sampling a small group is selected from a large population without any aim or predetermination. The small group is called sample.  In this method each item of population has an equal and independent chance of being included in the sample.  The random sample is selected by lottery method. 5.1.1. Simple Random Sampling:  In this method a sample is selected by which each item of the population has an equal and independent chance of being included in the sample.  In this method, certain number of its are chosen at random without any pre-determined basis. 5.1.2. Stratified Random Sampling:  This sampling technique is generally recommended when the population is heterogeneous.  In this method, whole of the population is divided into strata or sub groups possessing the similar characteristics.  Samples are selected taking equal proportion of items from each group.  Example: We want to select 100 students from a population of 1000 students, consist of 700 girls and 300 boys. So the whole population is divided into two strata – 700 girls and 300 boys. Now by simple sampling method selection of 70 girls and 30 boys are done to get sample of 100 students. 5.1.3. Systematic Random Sampling:  It is also known as Quasi Random Sampling.  In this method, all the items are arranged in some spatial or temporal order.  Example: Persons listed alphabetically in a telephone directory, plants growing in rows in field.
  • 8. Unit II Descriptive Statistics 1. Measures of Central Tendency:  A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position wi  Measures of central tendency are sometimes called measures of central location.  They are also classed as summary statistics. 1.1. Mean (Arithmetic): The mean is equal to the sum of all the values in the data set divided by the number the data set. So, if we have n values in a data set and they have values x mean, usually denoted by (pronounced x bar), is: This formula is usually written in a slightly different manner using the Greek capitol le pronounced "sigma", which means "sum of...": 1.1.1. Significance:  One of its important properties is that it value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set.  An important property of the mean is that it includes every value in your data set as part of the calculation.  In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. 1.2. Median: The median is the middle score for a set of data that has been arranged in order of The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: 1. Measures of Central Tendency: A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean is equal to the sum of all the values in the data set divided by the number the data set. So, if we have n values in a data set and they have values x1, x2, ..., x (pronounced x bar), is: This formula is usually written in a slightly different manner using the Greek capitol le pronounced "sigma", which means "sum of...": of its important properties is that it minimizes error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set. An important property of the mean is that it includes every value in your data set as part In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. The median is the middle score for a set of data that has been arranged in order of The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: A measure of central tendency is a single value that attempts to describe a set of data of central tendency are sometimes called measures of central location. The mean is equal to the sum of all the values in the data set divided by the number of values in , ..., xn, the sample This formula is usually written in a slightly different manner using the Greek capitol letter, , error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error An important property of the mean is that it includes every value in your data set as part In addition, the mean is the only measure of central tendency where the sum of the The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median,
  • 9. 65 55 89 56 35 14 56 55 87 45 92 We first need to rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92 Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when there is an odd number of scores, in case of even number of scores (for example 10 scores) simply we need to take the middle two scores and average the result. Example: 65 55 89 56 35 14 56 55 87 45 We again rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5. 1.3. Mode: The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. Normally, the mode is used for categorical data where we wish to know which the most common category, as illustrated below: To find out mode of an ungrouped data, the values are arranged in an ascendig order. The value which occurs maximum number of times is the mode. 18 21 23 23 25 25 25 27 29 29 In the above data 25 occurs maximum number of times. So 25 is the mode. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:
  • 10. Mode is a positional average. It is a measure of central value. When data has one concentration of frequency, it is called unimodal. When it has more than one concentration it is called bimodal (for 2 concentrations), trimodal (for 3 concentration) simultaneously. 1.3.1. Significance:  No need for mathematical calculation.  Mode can be easily found out.  Although it is not that much reliable. 1.4. Range:  Range is the difference between the lowest value and highest value of a set of data.  Range = Largest value (Xm) – Smallest value (X0) 1.4.1. Coefficient of Range: This is a relative measure of dispersion and is based on the value of the range. It is also called range coefficient of dispersion. It is defined as:
  • 11. 2.1. Variance: Variance is the average of the square differences from the mean. Steps involved in calculating the variance: i. Calculate mean. ii. Subtract the mean from each value. iii. Square the result. iv. Add the squared numbers. v. Take the average of the squared results. 2.2. Standard Deviation:  Standard deviation is a measure of deviation.  The Standard Deviation is a measure of how spreads out numbers are.  Its symbol is SD or σ (the Greek letter sigma).  The formula is: it is the square root of the Variance. Example: The heights of the dogs (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. Find out the Mean, the Variance, and the Standard Deviation.
  • 12. The first step is to find the Mean: Mean = (600 + 470 + 170 + 430 + 300)/ 5 = 1970/5 = 394 To calculate the Variance, take each difference, square it, and then average the result: Variance σ2 = { 2062 + 762 + (−224)2 + 362 + (−94)2 } / 5 = (42436 + 5776 + 50176 + 1296 + 8836) / 5 = 108520 / 5 = 21704 So the Variance is 21,704 And the Standard Deviation is just the square root of Variance, so: Standard Deviation (σ) = √21704 = 147.32... = 147 (to the nearest mm) Therefore, The square of the standard deviation is called variance. 2.3. Coefficient of Variations (CV): Coefficient of Variations is the standard deviation expressed as a percentage of the mean. It is a relative measure of the mean. Coefficient of Variations = (SD / X) * 100 Where, SD = Standard Deviation, X = Mean 2.4. Grouped Data:  Grouped data is data that has been organized into groups known as classes.
  • 13.  Grouped data has been 'classified' and thus some level of data analysis has taken place, which means that the data is no longer raw.  A data class is group of data which is related by some user defined property. For example, while collecting the ages of the people we could group them into classes as those in their teens, twenties, thirties, forties and so on. Each of those groups is called a class.  Each of those classes is of a certain width and this is referred to as the Class Interval or Class Size.  This class interval is very important when it comes to drawing Histograms and Frequency diagrams. All the classes may have the same or different class size. Below is an example of grouped data where the classes have the same class interval. Age (years) Frequency 0 - 9 12 10 - 19 30 20 - 29 18 30 - 39 12 40 - 49 9 50 - 59 6 60 - 69 0 Below is an example of grouped data where the classes have different class interval. Age (years) Frequency Class Interval 0 - 9 15 10 10 - 19 18 10 20 - 29 17 10 30 - 49 35 20 50 - 79 20 30 2.5. Graphical Methods: These methods are applied to visually describe data from a sample or population. Graphs provide visual summaries of data which is more quickly and completely describe essential information than tables of numbers.
  • 14. There are many types of graphical representation: 2.5.1. The Bar Chart: To Construct a Bar Chart,  Place categories on the horizontal axis,  Then place frequency (or relative frequency) on the vertical axis.  After that construct vertical bars of equal width, one for each category.  Its height is proportional to the frequency (or relative frequency) of the category. Fig: Example of a Bar Chart 2.5.2. The Pie Chart: For drawing pie chart,  Make complete circle that represents the total number of measurements. Partition into slices - one for each category.  Then, the size of a slice is proportional to the relative frequency of that category.  Determine the angle of each slice by multiplying the relative frequency by 360 degree.
  • 15. Fig: Example of a Pie Chart, Use of different Web Browser 2.5.3. Histogram: Histogram is a graph containing frequencies in the form of vertical rectangles. It is an area diagram.  It is a graphical presentation of frequency distribution.  The X – axis is marked with class intervals.  The Y – axis is marked with frequencies.  Vertical rectangles are drawn as per the height of the frequency of each class. Rectangles are drawn without any gap in between.  Histogram is a two dimensional diagram. Fig: Example of a histogram
  • 16. 2.5.4. Quantile plots: These visually portray the quantiles, or percentiles (which equals to the quantiles times 100) of the distribution of sample data. Quantiles of importance such as the median are easily discerned (quantile, or cumulative frequency = 0.5). Main benefits of Quantile plots are as follows: i. Arbitrary categories are not required, as with histograms or S-L's. ii. All of the data are displayed, unlike a box plot. iii. Every point has a distinct position, without overlap. Fig: Example of a Quantile plots 2.5.5. Box Plot:  In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles.  Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.  Outliers may be plotted as individual points.  Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution
  • 17. Fig: Example of a Box Plot 2.5.6. Benefits of Graphical representation: 1. Acceptability: Graphical report is acceptable to people who have busy schedule because it easily highlights about the theme of the report. This helps to avoid wastage of time. 2. Comparative Analysis: Information can be compared in terms of graphical representation. Such comparative analysis helps for quick understanding and attention. 3. Less cost: Information, if descriptive, involves huge time to present properly. It involves more money to print the information but graphical presentation can be made in short but catchy view to make the report understandable. It obviously involves less cost. 4. Decision Making: Business executives can view the graphs at a glance and can make decision very quickly which is hardly possible through descriptive report. 5. Logical Ideas: If tables, design and graphs are used to represent information then a logical sequence is created to clear the idea of the audience. 6. Helpful for less educated Audience: Less literate or illiterate people can understand graphical representation easily because it does not involve going through line by line of any descriptive report. 7. Less Effort and Time: To present any table, design, image or graphs require less effort and time. Furthermore, such presentation makes quick understanding of the information. 8. Less Error and Mistakes: Qualitative or informative or descriptive reports involve errors or mistakes. As graphical representations are exhibited through numerical figures, tables or graphs, it usually involves less error and mistake. 9. A complete Idea: Such representation creates clear and complete idea in the mind of audience. Reading hundred pages may not give any scope to make decision. But an instant view or looking at a glance obviously makes an impression in the mind of audience regarding the topic or subject. 10. Use in the Notice Board: Such representation can be hanged in the notice board to quickly raise the attention of employees in any organization.
  • 18. 2.5.7. Graphical representation has some drawbacks also: 1. Expensive: Graphical representations of reports are costly because it involves images, colors and paints. Combination of material with human efforts makes the graphical presentation expensive. 2. More time: Graphical representation involves more time as it requires gra which are dependent to more time. 3. Errors and Mistakes: Since graphical representations are complex, there is chance of errors and mistake. This causes problems for better understanding to general people. 4. Lack of Privacy: Graphical represent may hamper the objective to keep something secret. 5. Problems to select the appropriate method: various graphical methods and ways. Which should be the suitable method is select. 6. Problem of Understanding: representation because it involves various technical matters which are complex to general people. 2.6. Obtaining Descriptive Statistics on Computer (MS Ex Suppose, we may have the scores of 14 participants for a test. To generate descriptive statistics for these scores, execute the following steps. 1. On the Data tab, in the Analysis group, click Graphical representation has some drawbacks also: Graphical representations of reports are costly because it involves images, and paints. Combination of material with human efforts makes the graphical Graphical representation involves more time as it requires gra which are dependent to more time. Since graphical representations are complex, there is chance of errors and mistake. This causes problems for better understanding to general people. Graphical representation makes full presentation of information which may hamper the objective to keep something secret. Problems to select the appropriate method: Information can be presented through various graphical methods and ways. Which should be the suitable method is Problem of Understanding: All people cannot understand the meaning of graphical representation because it involves various technical matters which are complex to general Obtaining Descriptive Statistics on Computer (MS Excel): may have the scores of 14 participants for a test. To generate descriptive statistics for these scores, execute the following steps. 1. On the Data tab, in the Analysis group, click Data Analysis. Graphical representations of reports are costly because it involves images, and paints. Combination of material with human efforts makes the graphical Graphical representation involves more time as it requires graphs and figures Since graphical representations are complex, there is chance of errors and mistake. This causes problems for better understanding to general people. ation makes full presentation of information which Information can be presented through various graphical methods and ways. Which should be the suitable method is very hard to All people cannot understand the meaning of graphical representation because it involves various technical matters which are complex to general
  • 19. Note: can't find the Data Analysis button? Click here to load the 2. Select Descriptive Statistics and click OK. 3. Select the range A2:A15 as the Input Ran 4. Select cell C1 as the Output Range. 5. Make sure summary statistics is checked. 6. Click OK. can't find the Data Analysis button? Click here to load the Analysis ToolPak 2. Select Descriptive Statistics and click OK. 3. Select the range A2:A15 as the Input Range. 4. Select cell C1 as the Output Range. 5. Make sure summary statistics is checked. ToolPak add-in.
  • 20. Result: 3. Case Study: In the social sciences and life sciences in-depth, and detailed examination of a subject of study (the contextual conditions. Case studies can be produced by following a formal likely to appear in formal research venues, as journals and professional conferences, rather popular works. The resulting body of 'case study research' has long had a prominent place in many disciplines and professions, ranging from psychology, anthropology, sociology, and political science to education, clinical science, social work, and ad 3.1. Types of Case Studies: Under the more generalized category of case study exist several subdivisions, each of which is custom selected for use depending upon the goals of the investigator. These types of case study include the following:  Illustrative case studies: These are primarily descriptive studies. They typically utilize one or two instances of an event to show the existing situation. Illustrative case studies serve primarily to make the unfamiliar familiar and to give reade topic in question.  Exploratory (or pilot) case studies implementing a large scale investigation. Their basic function is to help identify questions and select types of measurement prior to the main investigation. The primary pitfall of this life sciences, a case study is a research method involving an up depth, and detailed examination of a subject of study (the case), as well as its related Case studies can be produced by following a formal research method. These case studies are likely to appear in formal research venues, as journals and professional conferences, rather popular works. The resulting body of 'case study research' has long had a prominent place in many disciplines and professions, ranging from psychology, anthropology, sociology, and political science to education, clinical science, social work, and administrative science. Under the more generalized category of case study exist several subdivisions, each of which is custom selected for use depending upon the goals of the investigator. These types of case study These are primarily descriptive studies. They typically utilize one or two instances of an event to show the existing situation. Illustrative case studies serve primarily to make the unfamiliar familiar and to give readers a common language about the Exploratory (or pilot) case studies: These are condensed case studies performed before implementing a large scale investigation. Their basic function is to help identify questions ement prior to the main investigation. The primary pitfall of this is a research method involving an up-close, ), as well as its related method. These case studies are likely to appear in formal research venues, as journals and professional conferences, rather than popular works. The resulting body of 'case study research' has long had a prominent place in many disciplines and professions, ranging from psychology, anthropology, sociology, and ministrative science. Under the more generalized category of case study exist several subdivisions, each of which is custom selected for use depending upon the goals of the investigator. These types of case study These are primarily descriptive studies. They typically utilize one or two instances of an event to show the existing situation. Illustrative case studies serve rs a common language about the These are condensed case studies performed before implementing a large scale investigation. Their basic function is to help identify questions ement prior to the main investigation. The primary pitfall of this
  • 21. type of study is that initial findings may seem convincing enough to be released prematurely as conclusions.  Cumulative case studies: These serve to aggregate information from several sites collected at different times. The idea behind these studies is that the collection of past studies will allow for greater generalization without additional cost or time being expended on new, possibly repetitive studies.  Critical instance case studies: These examine one or more sites either for the purpose of examining a situation of unique interest with little to no interest in generalization, or to call into question a highly generalized or universal assertion. This method is useful for answering cause and effect questions.
  • 22. Unit III Probability and Distribution: 1. Probability: Probability is the proportion of times an event occurs in a set of trials. The word ‘probability’ means chance; likely to happen. Probability is calculated by following formula: P = e / t Where, P = Probability e = number of times an event occurs or frequency t = total number of trials or items The probability value is always a fraction falling between 1 and 0. Example: When a dice numbered from 1 to 6 is tossed, the total number of chance is 6. The probability of any number is, 1/6 = 0.17 p is the probability of an event to occur. Q is the probability of the event not occurring. So, When, p is known, q can be calculated. q = 1 – p p = 1 - q 1.1. Laws of Probability: There are two types of theorems of probability, namely 1. Addition theorems. 2. Multiplications theorems. 1.1.1. Addition Theorem:  The probability of the occurrence of a mutually exclusive event is the sum of the individual probabilities of the events.  Mutually exclusive events cannot occur simultaneously.  The occurrence of the one event prevents the occurrence of the other events.  Example: In coin tossing experiment, the occurrence of head excludes the occurrence of tail.
  • 23. If probability of head is A and that of tail is B. Then, Probability of Head = p(A) = p(A) + p(B) Probability of Tail = p(B) = p(A) + p(B) 1.1.2. Multiplication Theorem:  The probability of the occurrence of two independent events is the product of their individual probabilities.  For independent event, the probability is calculated by multiplication.  The independent event will not affect the occurrence of other events. When two coins are tossed, the result of the first coin does not affect the second coin.  Example: The probability of two independent events as P(A) and P(B) Probability of A = P(A) = P(A) x P(B) Probability of B = P(B) = P(A) x P(B) 2. Random events: the events whose outcome is unknown are called random experiments. For example when we toss a coin, we do not know if it will land heads up or tails up. Hence tossing a coin is a random experiment. Another example is the result of an interview or examination. When we speak about random experiments, we have to know what the sample space is. Sample space denoted by S is the set of all possible outcomes of a random experiment. Example: consider the random experiment of tossing a die. Let us write down the sample space S here. The sample space is all the possible outcomes here. What are the possible outcomes when we toss a die once? As we know a die has 6 faces numbered 1,2,3,4,5,6. When we toss it once, only one of the face will turn up. Hence the sample space is S= {1,2,3,4,5,6} Consider one more simple example of tossing two coins. Let us write down the sample space here. S= {(H,H),(H,T),(T,H),(T,T)}
  • 24. Here, H: head; T: tail 3. Events-exhaustive: Two or more events are said to be exhaustive if there is a certain chance of occurrence of at least one of them when they are all considered together. Exhaustive event can be either elementary or even compound. Example: consider the experiment of a fair die being thrown. Then there are six outcomes and all of them are equally likely to occur. Also the events of getting different numbers taken together are exhaustive as together at least one of them is certain to happen. For getting a 2 or 5, sure will get one of the numbers during the experiment. So events are exhaustive. 4. Mutually Exclusive Event: Mutually exclusive events cannot occur together simultaneously. The occurrence of one event prevents the occurrence of the other event. The mutually exclusive events are connected by the words ‘either or’. Example: Head and tail of a coin. 5. Equally Likely Events: Equally likely events have equal chances of occurrence. Example: Winning or losing in a game. Head or tail of a coin. 6. Binomial Distribution: A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). Examples of binomial distribution problems:  The number of defective/non-defective products in a production run.  Yes/No Survey (such as asking 150 people if they watch ABC news).  Vote counts for a candidate in an election.  The number of successful sales calls.  The number of male/female workers in a company 6.1. Criteria of Binomial distributions: i. The number of observations or trials is fixed. In other words, you can only figure out the probability of something happening if you do it a certain number of times. This is common sense—if you toss a coin once; your probability of getting a tails is 50%. If you toss a coin a 20 times, your probability of getting a tails is very, very close to 100%.
  • 25. ii. Each observation or trial is independent. In other words, none of your trials have an effect on the probability of the next trial. iii. The probability of success (tails, heads, fail or pass) is exactly the same from one trial to another. 6.2. Formula: Notations for Binomial Distribution and the Mass Formula: Where:  P is the probability of success on any trail.  q = 1- P – the probability of failure  n – the number of trails/experiments  x – The number of successes, it can take the values 0, 1, 2, 3 . . . n.  nCx = n!/x!(n-x) and denotes the number of combinations of n elements taken x at a time. Assuming what the nCx means, we can write the above formula in this way: Problem: A box of candies has many different colors in it. There is a 15% chance of getting a pink candy. What is the probability that exactly 4 candies in a box are pink out of 10? We have that: n = 10, p=0.15, q=0.85, x=4 When we replace in the formula: Interpretation: The probability that exactly 4 candies in a box are pink is 0.04.
  • 26. 6.3. Properties of binomial distribution: 1. Binomial distribution is applicable when the trials are independent and each trial has just two outcomes success and failure. It is applied in coin tossing experiments, sampling inspection plan, genetic experiments and so on. 2. Binomial distribution is known as bi-parametric distribution as it is characterized by two parameters n and p. This means that if the values of n and p are known, then the distribution is known completely. 3. The mean of the binomial distribution is given by μ = np 4. Depending on the values of the two parameters, binomial distribution may be uni-modal or bi- modal. To know the mode of binomial distribution, first we have to find the value of (n+1)p. (n+1)p is a non integer --------> Uni-modal Here, the mode = the largest integer contained in (n+1)p (n+1)p is a integer --------> Bi-modal Here, the mode = (n+1|)p, (n+1)p - 1 5. The variance of the binomial distribution is given by σ² = npq 6. Since p and q are numerically less than or equal to 1, npq < np That is, variance of a binomial variable is always less than its mean. 7. Variance of binomial variable X attains its maximum value at p = q = 0.5 and this maximum value is n/4.
  • 27. 8. Additive property of binomial distribution. Let X and Y are the two independent binomial variables. X is having the parameters n₁ and p and Y is having the parameters n₂ and p. Then (X+Y) will also be a binomial variable with the parameters (n₁ + n₂) and p 7. Poisson Distribution:  Poisson distribution was devised by Poisson in 1837.  It is a discrete frequency distribution.  Poisson distribution describes the occurrence of rate events and the small events. Hence it is called law of improbable events.  When the probability of the event is very rare in a large number of trials, the resulting distribution is called Poisson distribution.  Example: Number of death due to heart attack in a hospital or a town. 7.1. Properties of Poisson Distribution:  The probability of the success of the event (p) is very small and approaches zero.  The probability of the failure of the event (q) is very high and almost equal to 1 and n is also large.  Poisson distribution has a single parameter called mean denoted by m. m = np = constant.  The formula used for Poisson distribution is as follows: Probability of r success P(r) = e-m mr / 1 p = Probability r = 0, 1, 2, 3….n successes. e = 2.7183 (constant)  SD (Standard Deviation) of Poisson distribution is = √m = √np  Variance = SD2 = m = np 8. Normal Distribution: Normal distribution is a continuous probability distribution. In this distribution the values are clustered closely around the centre and the values decrease towards the left and right.
  • 28. Example: The height of students in a class is a typical example for normal distribution. The height of most students will be between 150cm and 170cm. The height of only a few students will be less than 150 cm and the height of only a few students will be above 170 Thus there is an increasing number towards the middle point and a decreasing number towards the end. 8.1. Properties of Normal Distribution:  The graph obtained for normal distribution is called normal distribution curve.  The normal distribution curv number of individuals (frequency) in the Y axis.  The normal distribution curve is symmetrical. It is bell shaped. Fig: Example of a Normal Distribution Curve  Normal distribution curve is als Coral Gauss.  The normal distribution curve is a continuous distribution. It is associated with height, weight, age, rate of respiration etc.  It has only one maximum peak. Hence it is a unimodel curve.  The height of normal curve is maximum at its mean.  Mean, median and mode are equal for normal distribution. Mean = Median = Mode.  Most of the values are clustered around the mean and there are relatively observations at the extreme ends. The height of students in a class is a typical example for normal distribution. students will be between 150cm and 170cm. The height of only a few will be less than 150 cm and the height of only a few students will be above 170 Thus there is an increasing number towards the middle point and a decreasing number towards 8.1. Properties of Normal Distribution: The graph obtained for normal distribution is called normal distribution curve. The normal distribution curve is obtained when the values are given in the X axis and the number of individuals (frequency) in the Y axis. The normal distribution curve is symmetrical. It is bell shaped. Example of a Normal Distribution Curve Normal distribution curve is also called Gaussian curve, named after the discoverer The normal distribution curve is a continuous distribution. It is associated with height, weight, age, rate of respiration etc. It has only one maximum peak. Hence it is a unimodel curve. The height of normal curve is maximum at its mean. Mean, median and mode are equal for normal distribution. Mean = Median = Mode. Most of the values are clustered around the mean and there are relatively observations at the extreme ends. The height of students in a class is a typical example for normal distribution. students will be between 150cm and 170cm. The height of only a few will be less than 150 cm and the height of only a few students will be above 170cm. Thus there is an increasing number towards the middle point and a decreasing number towards The graph obtained for normal distribution is called normal distribution curve. e is obtained when the values are given in the X axis and the o called Gaussian curve, named after the discoverer The normal distribution curve is a continuous distribution. It is associated with height, Mean, median and mode are equal for normal distribution. Mean = Median = Mode. Most of the values are clustered around the mean and there are relatively a few
  • 29.  The normal curve never touches the horizontal axis.  The mean deviation is equal to standard deviation.
  • 30. Unit IV: Correlation and Regression Analysis 1. Correlation:  Correlation, in the finance and investment industries, is a statistic that measures the degree to which two securities move in relation to each other.  Correlations are used in advanced portfolio management, computed as the correlation coefficient, which has a value that must fall between -1.0 and +1.0.  Correlation is a statistic that measures the degree to which two variables move in relation to each other. 1.1. Definition of Correlation:  According to Taro Yamane, “Correlation analysis is a discussion of the degree of closeness of the relationship between two variables.”  According to Ya Lun Chou, “Correlation analysis attempts to determine the degree of relationship between variables.”  According to Prof. Bodding, “Wherever some definite connection exists between 2 or more groups, classes or series of data, there is said to be a correlation.”  A very simple definition is given by A. M. Tuttle, “An analysis of the co-variation of two or more variables is usually called correlation.” 1.2. The Formula for Correlation: Correlation measures association, but does not tell you if x causes y or vice versa, or if the association is caused by some third (perhaps unseen) factor. 1.3. Positive Correlation: A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as one security moves, either up or down, the other security moves in lockstep, in the same direction. 1.4. Negative Correlation: A perfect negative correlation means that two assets move in opposite directions, while a zero correlation implies no relationship at all.
  • 31. 1.5. Calculation of Correlation: (Karl Pearson’s Coefficient of Correlation) Karl Pearson, a great biometrician and statistician, suggested a mathematical method for measuring the magnitude of linear relationship between two variables. Karl Pearson’s method is the most widely used method in practice and is known as Pearsonian coefficient of correlation. It is denoted by the symbol “ ”. The simplest formula is- The value of the coefficient of correlation shall always lie between +1 and -1, when = +1, then there is a perfect positive correlation between the two variables. When = -1, then there is perfect negative correlation between the two variables. When = 0, then there is no relationship or correlation between two variables. Theoretically, we get values which lie between +1 and -1; but normally the value lies between +0.8 and -0.5. 1.6. Problem: Find the coefficient of correlation between the age of husbands (X) and the age of wives (Y). X 23 27 28 28 29 30 31 33 35 36 Y 18 20 22 27 21 29 27 29 28 29 Solution:
  • 32.
  • 33. 2. Covariance:  In probability theory and statistics, covariance is a measure of the joint variability of two random variables.  If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive.  In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative.  The sign of the covariance therefore shows the tendency in the linear relationship between the variables. 2.1. The Covariance Formula: The formula is: Cov(X,Y) = Σ E((X-μ)E(Y-ν)) / n-1 where: X is a random variable E(X) = μ is the expected value (the mean) of the random variable X and E(Y) = ν is the expected value (the mean) of the random variable Y n = the number of items in the data set Example: Calculate covariance for the following data set: X: 2.1, 2.5, 3.6, 4.0 (mean = 3.1) Y: 8, 10, 12, 14 (mean = 11) Substitute the values into the formula and solve: Cov(X,Y) = ΣE((X-μ)(Y-ν)) / n-1 = (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1) = (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3 = 3 + 0.6 + .5 + 2.7 / 3 = 6.8/3 = 2.267 The result is positive, meaning that the variables are positively related.
  • 34. 2.2. Covariance in Excel: Overview Covariance gives you a positive number if the variables are positively related. You’ll get a negative number if they are negatively related. A high covariance basically indicates there is a strong relationship between the variables. A low value means there is a weak relationship. Covariance in Excel: (Steps) Step 1: Enter your data into two columns in Excel. For example, type your X values into column A and your Y values into column B. Step 2: Click the “Data” tab and then click “Data analysis.” The Data Analysis window will open. Step 3: Choose “Covariance” and then click “OK.” Step 4: Click “Input Range” and then select all of your data. Include column headers if you have them. Step 5: Click the “Labels in First Row” check box if you have included column headers in your data selection. Step 6: Select “Output Range” and then select an area on the worksheet. A good place to select is an area just to the right of your data set. Step 7: Click “OK.” The covariance will appear in the area you selected in Step 5.
  • 35. 3. Scatter Diagram: A scatter diagram is a graph that shows the relationship between two variables. Scatter diagrams can demonstrate a relationship between any element of a process, environment, or activity on one axis and a quality defect on the other axis. 3.1. Type of Scatter Diagram According to the type of correlation, scatter diagrams can be divided into following categories:  Scatter Diagram with No Correlation  Scatter Diagram with Moderate Correlation  Scatter Diagram with Strong Correlation 3.1.1. Scatter Diagram with No Correlation This type of diagram is also known as “Scatter Diagram with Zero Degree of Correlation”. In this type of scatter diagram, data points are spread so randomly that you cannot draw any line through them. In this case you can say that there is no relation between these two variables. 3.1.2. Scatter Diagram with Moderate Correlation This type of diagram is also known as “Scatter Diagram with Low Degree of Correlation”.
  • 36. Here, the data points are little closer together and you can feel that some kind of relation exists between these two variables. 3.1.3. Scatter Diagram with Strong Correlation This type of diagram is also known as “Scatter Diagram with High Degree of Correlation”. In this diagram, data points are grouped very close to each other such that you can draw a line by following their pattern. In this case you will say that the variables are closely related to each other. As discussed earlier, we can also divide the scatter diagram according to the slope, or trend, of the data points:
  • 37.  Scatter Diagram with Strong Positive Correlation  Scatter Diagram with Weak Positive Correlation  Scatter Diagram with Strong Negative Correlation  Scatter Diagram with Weak Negative Correlation  Scatter Diagram with Weakest (or no) Correlation Strong positive correlation means there is a clearly visible upward trend from left to right; a strong negative correlation means there is a clearly visible downward trend from left to right. A weak correlation means the trend, up of down, is less clear. A flat line from left to right is the weakest correlation, as it is neither positive nor negative and indicates the independent variable does not affect the dependent variable. 3.1.4. Scatter Diagram with Strong Positive Correlation This type of diagram is also known as Scatter Diagram with Positive Slant. In positive slant, the correlation will be positive, i.e. as the value of x increases, the value of y will also increase. You can say that the slope of straight line drawn along the data points will go up. The pattern will resemble the straight line. For example, if the temperature goes up, cold drink sales will also go up.
  • 38. 3.1.5. Scatter Diagram with Weak Positive Correlation Here as the value of x increases the value of y will also tend to increase, but the pattern will not closely resemble a straight line. 3.1.6. Scatter Diagram with Strong Negative Correlation This type of diagram is also known as Scatter Diagram with Negative Slant. In negative slant, the correlation will be negative, i.e. as the value of x increases, the value of y will decrease. The slope of a straight line drawn along the data points will go down. For example, if the temperature goes up, sales of winter coats goes down.
  • 39. 3.1.7. Scatter Diagram with Weak Negative Correlation Here as the value of x increases the value of y will tend to decrease, but the pattern will not be as well defined. 4. Dot Diagram:  A dot diagram or dot plot is a statistical chart consisting of data points plotted on a fairly simple scale, typically using filled in circles.  The dot plot as a representation of a distribution consists of group of data points plotted on a simple scale.  Dot plots are used for continuous, quantitative, univariate data.  Data points may be labelled if there are few of them.  Dot plots are one of the simplest statistical plots, and are suitable for small to moderate sized data sets.  They are useful for highlighting clusters and gaps, as well as outliers.  Their other advantage is the conservation of numerical information.
  • 40. 5. General Concept of Regression:  Regression is the measures of the average relationship between two or more variables in terms of the original units of the data.  Estimation of regression is called regression analysis.  In regression analysis two variables are involved. One variable is called dependent variable and the other is called independent variable. E.g. the yield of rice and rainfall are related. Yield of rice is a dependent variable and rainfall is an independent variable. 5.1. Definitions:  “Regression analysis attempts to establish the nature of the relationship between variables, that is, to study the functional relationship between the variables and thereby provide a mechanism for predicting or forecasting.” – Ya-Lun-Chow.  “Regression is the measure of the average relationship between two or more variables in terms of the original units of the data” – Blair. 5.2. Regression Lines:  The graphic representation of regression is called regression line.  One variable is represented as X and the other one as Y.  In a simple linear regression, there are two regression lines constructed for the relationship between two variables, say X and Y.  One regression line shows regression of X upon Y and the other shows the regression of Y upon X.  When there is perfectly positive correlation (+1) or perfectly negative correlation (-1) the two regression lines will coincide with each other i.e., there will be only one line.  If the regression lines are nearer to each other, then there is a higher degree of correlation.  If the two lines are farther away from each other, then there is lesser degree of correlation.  If = 0, both variables are independent. There is no correlation. So both will cut each other at right angle.
  • 41.
  • 42. 5.3. Regression Coefficient & its Properties: 5.3.1. level-level model The basic form of linear regression (without the residuals) The basic formula for linear regression can be seen above. In the formula, y denotes the dependent variable and x is the independent variable. For simplicity let’s assume that it is univariate regression, but the principles obviously hold for the multivariate case as well. To put it into perspective, let’s say that after fitting the model we receive: Intercept (a)  x is continuous and centered (by subtracting the mean of x from each observation, the average of transformed x becomes 0) — average y is 3 when x is equal to the sample mean  x is continuous, but not centered — average y is 3 when x = 0  x is categorical — average y is 3 when x = 0 (this time indicating a category, more on this below) Coefficient (b)  x is a continuous variable Interpretation: a unit increase in x results in an increase in average y by 5 units, all other variables held constant.  x is a categorical variable
  • 43. This requires a bit more explanation. Let’s say that x describes gender and can take values (‘male’, ‘female’). Now let’s convert it into a dummy variable which takes values 0 for males and 1 for females. Interpretation: average y is higher by 5 units for females than for males, all other variables held constant. 5.3.2. log-level model Log denotes the natural logarithm Typically we use log transformation to pull outlying data from a positively skewed distribution closer to the bulk of the data, in order to make the variable normally distributed. In the case of linear regression, one additional benefit of using the log transformation is interpretability. Example of log transformation: right — before, left — after. Source As before, let’s say that the formula below presents the coefficients of the fitted model.
  • 44. Intercept (a) Interpretation is similar as in the vanilla (level-level) case, however, we need to take the exponent of the intercept for interpretation exp(3) = 20.09. The difference is that this value stands for the geometric mean of y as opposed to the arithmetic mean in case of the level-level model). Coefficient (b) The principles are again similar to the level-level model when it comes to interpreting categorical/numeric variables. Analogically to the intercept, we need to take the exponent of the coefficient: exp(b) = exp(0.01) = 1.01. This means that a unit increase in x causes a 1% increase in average (geometric) y, all other variables held constant. Two things worth mentioning here:  There is a rule of thumb when it comes to interpreting coefficients of such a model. If abs(b) < 0.15 it is quite safe to say that when b = 0.1 we will observe a 10% increase in y for a unit change in x. For coefficients with larger absolute value, it is recommended to calculate the exponent.  When dealing with variables in [0, 1] range (like a percentage) it is more convenient for interpretation to first multiply the variable by 100 and then fit the model. This way the interpretation is more intuitive, as we increase the variable by 1 percentage point instead of 100 percentage points (from 0 to 1 immediately). 5.3.3. level-log model Let’s assume that after fitting the model we receive: The interpretation of the intercept is the same as in the case of the level-level model.
  • 45. For the coefficient b — a 1% increase in x results in an approximate increase in average y by b/100 (0.05 in this case), all other variables held constant.To get the exact amount, we would need to take b× log(1.01), which in this case gives 0.0498. 5.3.4. log-log model Let’s assume that after fitting the model we receive: Once again focus on the interpretation of b. An increase in x by 1% results in 5% increase in average (geometric) y, all other variables held constant. To obtain the exact amount, we need to take 6. Standard Error:  Standard error is the difference between the means of the population and its sample.  Standard error is defined as the ratio of standard deviation of the sample divided by the square root of the total number of observations. Standard Error = SD / √N SD = Standard Error N = Total Number of Observations
  • 46. Standard error is abbreviated as SE. It is given in the same unit as the data. 6.1. Uses of Standard Error:  It helps to understand the difference between two samples.  It helps to calculate the size of the sample.  To determine whether the sample is drawn from a known population or not.
  • 47. Unit V: Statistical Hypothesis Testing 1. Making Assumption:  Statistical hypothesis testing requires several assumptions.  These assumptions include  Considerations of the level of measurement of the variable.  The method of sampling, the shape of the population distribution.  The sample size.  The specific assumptions may vary, depending on the test or the conditions of testing. However, without exception, all statistical tests assume random sampling.  As example, based on our data, we can test the hypothesis that the average price of gas in California is higher than the average national price of gas. The test we are considering meets these conditions:  The sample of California gas stations was randomly selected.  The variable price per gallon is measured at the interval-ratio level.  We cannot assume that the population is normally distributed. 2. Statistical Hypotheses:  A statistical hypothesis is an assumption about a population parameter.  This assumption may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses.  The best way to determine whether a statistical hypothesis is true would be to examine the entire population.  Since that is often impractical, researchers typically examine a random sample from the population.  If sample data are not consistent with the statistical hypothesis, the hypothesis is rejected. There are two types of statistical hypotheses: 2.1. Null hypothesis: The null hypothesis, denoted by Ho, is usually the hypothesis that sample observations result purely from chance.
  • 48. 2.2. Alternative hypothesis: The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample observations are influenced by some non-random cause. For example, suppose we wanted to determine whether a coin was fair and balanced. A null hypothesis might be that half the flips would result in Heads and half, in Tails. The alternative hypothesis might be that the number of Heads and Tails would be very different. Symbolically, these hypotheses would be expressed as: Ho: P = 0.5 Ha: P ≠ 0.5 Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given this result, we would be inclined to reject the null hypothesis. We would conclude, based on the evidence, that the coin was probably not fair and balanced. 3. Hypothesis Tests Statisticians follow a formal process to determine whether to reject a null hypothesis, based on sample data. This process, called hypothesis testing, consists of four steps.  State the hypotheses. This involves stating the null and alternative hypotheses. The hypotheses are stated in such a way that they are mutually exclusive. That is, if one is true, the other must be false.  Formulate an analysis plan. The analysis plan describes how to use sample data to evaluate the null hypothesis. The evaluation often focuses around a single test statistic.  Analyze sample data. Find the value of the test statistic (mean score, proportion, t statistic, z-score, etc.) described in the analysis plan.  Interpret results. Apply the decision rule described in the analysis plan. If the value of the test statistic is unlikely, based on the null hypothesis, reject the null hypothesis. 4. Errors in Hypothesis Testing Two types of errors can result from a hypothesis test:  Type I error. A Type I error occurs when the researcher rejects a null hypothesis when it is true. The probability of committing a Type I error is called the significance level. This probability is also called alpha, and is often denoted by α.
  • 49.  Type II error. A Type II error occurs when the researcher fails to reject a null hypothesis that is false. The probability of committing a Type II error is called Beta, and is often denoted by β. The probability of committing a Type II error is called the Power of the test. 5. Decision Making Rules The analysis plan includes decision rules for rejecting the null hypothesis. In practice, statisticians describe these decision rules in two ways - with reference to a P-value or with reference to a region of acceptance.  P-value: The strength of evidence in support of a null hypothesis is measured by the P- value. Suppose the test statistic is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the null hypothesis is true. If the P-value is less than the significance level, we reject the null hypothesis.  Region of acceptance: The region of acceptance is a range of values. If the test statistic falls within the region of acceptance, the null hypothesis is not rejected. The region of acceptance is defined so that the chance of making a Type I error is equal to the significance level. The set of values outside the region of acceptance is called the region of rejection. If the test statistic falls within the region of rejection, the null hypothesis is rejected. In such cases, we say that the hypothesis has been rejected at the α level of significance. These approaches are equivalent. Some statistics texts use the P-value approach; others use the region of acceptance approach. On this website, we tend to use the region of acceptance approach. 6. One-Tailed and Two-Tailed Tests A test of a statistical hypothesis, where the region of rejection is on only one side of the sampling distribution, is called a one-tailed test. For example, suppose the null hypothesis states that the mean is less than or equal to 10. The alternative hypothesis would be that the mean is greater than 10. The region of rejection would consist of a range of numbers located on the right side of sampling distribution; that is, a set of numbers greater than 10. A test of a statistical hypothesis, where the region of rejection is on both sides of the sampling distribution, is called a two-tailed test. For example, suppose the null hypothesis states that the
  • 50. mean is equal to 10. The alternative hypothesis would be that the mean is less than 10 or greater than 10. The region of rejection would consist of a range of numbers located on both sides of sampling distribution; that is, the region of rejection would consist partly of numbers that were less than 10 and partly of numbers that were greater than 10. 7. Confidence Interval:  A confidence interval is how much uncertainty there is with any particular statistic. Confidence intervals are often used with a margin of error.  It states how confident one can be that the results from a poll or survey reflect what was expected to find if it were possible to survey the entire population.  Confidence intervals are intrinsically connected to confidence levels.  Confidence intervals consist of a range of potential values of the unknown population parameter.  However, the interval computed from a particular sample does not necessarily include the true value of the parameter.  Based on the (usually taken) assumption that observed data are random samples from a true population, the confidence interval obtained from the data is also random.  The confidence level is designated prior to examining the data. Most commonly, the 95% confidence level is used. However, other confidence levels can be used, for example, 90% and 99%.  Factors affecting the width of the confidence interval include the size of the sample, the confidence level, and the variability in the sample.  A larger sample will tend to produce a better estimate of the population parameter, when all other factors are equal.  A higher confidence level will tend to produce a broader confidence interval.
  • 51. Unit VI: Test of Significance: 1. Steps in Testing Statistical Significance: 1. The first step is to specify the null hypothesis. For a two-tailed test, the null hypothesis is typically that a parameter equals zero although there are exceptions. A typical null hypothesis is μ1 - μ2 = 0 which is equivalent to μ1= μ2. For a one-tailed test, the null hypothesis is either that a parameter is greater than or equal to zero or that a parameter is less than or equal to zero. If the prediction is that μ1 is larger than μ2, then the null hypothesis (the reverse of the prediction) is μ2 - μ1 ≥ 0. This is equivalent to μ1 ≤ μ2. 2. The second step is to specify the α level which is also known as the significance level. Typical values are 0.05 and 0.01. 3. The third step is to compute the probability value (also known as the p value). This is the probability of obtaining a sample statistic as different or more different from the parameter specified in the null hypothesis given that the null hypothesis is true. 4. Finally, compare the probability value with the α level. If the probability value is lower then you reject the null hypothesis. Keep in mind that rejecting the null hypothesis is not an all-or-none decision. The lower the probability value, the more confidence you can have that the null hypothesis is false. However, if your probability value is higher than the conventional α level of 0.05, most scientists will consider your findings inconclusive. Failure to reject the null hypothesis does not constitute support for the null hypothesis. It just means you do not have sufficiently strong data to reject it. 2. Sampling Distribution of Mean and Standard Error: The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size. The sampling distribution depends on the underlying distribution of the population, the statistic being considered, the sampling procedure employed, and the sample size used. There is often considerable interest in whether the sampling distribution can be approximated by an asymptotic distribution, which corresponds to the limiting case either as the number of
  • 52. random samples of finite size, taken from an infinite population and used to produce the distribution, tends to infinity, or when just one equally-infinite-size "sample" is taken of that same population. 2.1. Standard Error: The standard error (SE) is very similar to standard deviation. Both are measures of spread. The higher the number, the more spread out your data is. To put it simply, the two terms are essentially equal — but there is one important difference. While the standard error uses statistics (sample data) standard deviations use parameters (population data). (What is the difference between a statistic and a parameter?). In statistics, you’ll come across terms like “the standard error of the mean” or “the standard error of the median.” The SE tells you how far your sample statistic (like the sample mean) deviates from the actual population mean. The larger your sample size, the smaller the SE. In other words, the larger your sample size, the closer your sample mean is to the actual population mean. 2.2. SE Calculation: How you find the standard error depends on what stat you need. For example, the calculation is different for the mean or proportion. When you are asked to find the sample error, you’re probably finding the standard error. That uses the following formula: s/√n. You might be asked to find standard errors for other stats like the mean or proportion. 2.3. Standard Error Formula: The following tables show how to find the standard deviation (first table) and SE (second table). That assumes you know the right population parameters. If you don’t know the population parameters, you can find the standard error:  Sample mean.  Sample proportion.  Difference between means.
  • 53.  Difference between proportions. Parameter (Population) Formula for Standard Deviation. Sample mean, = σ / sqrt (n) Sample proportion, p = sqrt [P (1-P) / n) Difference between means. = sqrt [σ2 1/n1 + σ2 2/n2] Difference between proportions. = sqrt [P1(1-P1)/n1 + P2(1-P2)/n2] Statistic (Sample) Formula for Standard Error. Sample mean, = s / sqrt (n) Sample proportion, p = sqrt [p (1-p) / n) Difference between means. = sqrt [s2 1/n1 + s2 2/n2] Difference between proportions. = sqrt [p1(1-p1)/n1 + p2(1-p2)/n2] Key for above tables: P = Proportion of successes. Population. p = Proportion of successes. Sample. n = Number of observations. Sample. n2 = Number of observations. Sample 1. n2 = Number of observations. Sample 2.
  • 54. σ2 1 = Variance. Sample 1. σ2 2 = Variance. Sample 2. 2.4. Sampling Distribution of the Mean: Definition: The Sampling Distribution of the Mean is the mean of the population from where the items are sampled. If the population distribution is normal, then the sampling distribution of the mean is likely to be normal for the samples of all sizes. Following are the main properties of the sampling distribution of the mean:  Its mean is equal to the population mean, thus, (?X͞ =sample mean and ?p Population mean)  The population standard deviation divided by the square root of the sample size is equal to the standard deviation of the sampling distribution of the mean, thus: (σ = population standard deviation, n = sample size)  The sampling distribution of the mean is normally distributed. This means, the distribution of sample means for a large sample size is normally distributed irrespective of the shape of the universe, but provided the population standard deviation (σ) is finite. Generally, the sample size 30 or more is considered large for the statistical purposes. If the population is normal, then the distribution of sample means will be normal, irrespective of the sample size. σ͞x is a measure of precision through which the sample mean can be used to estimate the true value of a population mean. ?σ͞x varies in direct proportion to the change in the original population and inversely to the square of sample size ‘n’. Thus, the greater the variations in the original items of the population greater the variation expected in sampling error in using ͞x as an estimate of ?. It is to be noted that larger the sample size smaller is the standard error and vice- versa.
  • 55. 3. Large Sample Tests:  Some researchers choose to increase their sample size if they have an effect which is almost within significance level.  This is done since the researcher suspects that he is short of samples, rather than that there is no effect there. We need to be careful using this method, as it increases the chances of creating a false positive result.  When we have a higher sample size, the likelihood of encountering Type-I and Type-II errors occurring reduces, at least if other parts of our study is carefully constructed and problems avoided.  Higher sample size allows the researcher to increase the significance level of the findings, since the confidence of the result are likely to increase with a higher sample size.  This is to be expected because larger the sample size, the more accurately it is expected to mirror the behavior of the whole group.  Therefore if you want to reject your null hypothesis, then you should make sure your sample size is at least equal to the sample size needed for the statistical significance chosen and expected effects. 4. Z- Test:  A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution.  Because of the central limit theorem, many test statistics are approximately normally distributed for large samples.  For each significance level, the Z-test has a single critical value (for example, 1.96 for 5% two tailed) which makes it more convenient than the Student's t-test which has separate critical values for each sample size.  Therefore, many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance is known.  If the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large (n < 30), the Student's t-test may be more appropriate.  If T is a statistic that is approximately normally distributed under the null hypothesis, the next step in performing a Z-test is to estimate the expected value θ of T under the null hypothesis, and then obtain an estimate s of the standard deviation of T.
  • 56.  After that the standard score Z = (T − θ) / s is calculated, from which one-tailed and two- tailed p-values can be calculated as Φ(−Z) (for upper-tailed tests), Φ(Z) (for lower-tailed tests) and 2Φ(−|Z|) (for two-tailed tests) where Φ is the standard normal cumulative distribution function. 5. T- Test:  When the difference between two population averages is being investigated, a t test is used.  In other words, a t test is used when we wish to compare two means (the scores must be measured on an interval or ratio measurement scale). For example, we would use a t test if we wished to compare the reading achievement of boys and girls.  With a t test, we have one independent variable and one dependent variable. The independent variable (gender in this case) can only have two levels (male and female). The dependent variable would be reading achievement. If the independent had more than two levels, then we would use a one-way analysis of variance (ANOVA).  The test statistic that a t test produces is a t-value. Conceptually, t-values are an extension of z-scores. In a way, the t-value represents how many standard units the means of the two groups are apart.  With a t test, the researcher wants to state with some degree of confidence that the obtained difference between the means of the sample groups is too great to be a chance event and that some difference also exists in the population from which the sample was drawn.  In other words, the difference that we might find between the boys’ and girls’ reading achievement in our sample might have occurred by chance, or it might exist in the population.  If our t test produces a t-value that results in a probability of .01, we say that the likelihood of getting the difference we found by chance would be 1 in a 100 times.  We could say that it is unlikely that our results occurred by chance and the difference we found in the sample probably exists in the populations from which it was drawn. 5.1. Paired and Unpaired T- test:  T-tests are useful for comparing the means of two samples. There are two types: paired and unpaired.  Paired means that both samples consist of the same test subjects. A paired t-test is equivalent to a one-sample t-test.  Unpaired means that both samples consist of distinct test subjects. An unpaired t-test is equivalent to a two-sample t-test.
  • 57.  For example, if you wanted to conduct an experiment to see how drinking an energy drink increases heart rate, you could do it two ways.  The "paired" way would be to measure the heart rate of 10 people before they drink the energy drink and then measure the heart rate of the same 10 people after drinking the energy drink. These two samples consist of the same test subjects, so you would perform a paired t-test on the means of both samples.  The "unpaired" way would be to measure the heart rate of 10 people before drinking an energy drink and then measure the heart rate of some other group of people who have drank energy drinks. These two samples consist of different test subjects, so you would perform an unpaired t-test on the means of both samples. 6. Parametric and Non parametric tests: 6.1. Definition of Parametric Test The parametric test is the hypothesis test which provides generalizations for making statements about the mean of the parent population. A t-test based on Student’s t-statistic, which is often used in this regard. The t-statistic rests on the underlying assumption that there is the normal distribution of variable and the mean in known or assumed to be known. The population variance is calculated for the sample. It is assumed that the variables of interest, in the population are measured on an interval scale. 6.2. Definition of Nonparametric Test The nonparametric test is defined as the hypothesis test which is not based on underlying assumptions, i.e. it does not require population’s distribution to be denoted by specific parameters. The test is mainly based on differences in medians. Hence, it is alternately known as the distribution-free test. The test assumes that the variables are measured on a nominal or ordinal level. It is used when the independent variables are non-metric. In statistics, the Mann–Whitney U test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample. Key Differences between Parametric and Non-parametric Tests
  • 58. The fundamental differences between parametric and nonparametric test are discussed in the following points: 1. A statistical test, in which specific assumptions are made about the population parameter, is known as the parametric test. A statistical test used in the case of non-metric independent variables is called nonparametric test. 2. In the parametric test, the test statistic is based on distribution. On the other hand, the test statistic is arbitrary in the case of the nonparametric test. 3. In the parametric test, it is assumed that the measurement of variables of interest is done on interval or ratio level. As opposed to the nonparametric test, wherein the variable of interest are measured on nominal or ordinal scale. 4. In general, the measure of central tendency in the parametric test is mean, while in the case of the nonparametric test is median. 5. In the parametric test, there is complete information about the population. Conversely, in the nonparametric test, there is no information about the population. 6. The applicability of parametric test is for variables only, whereas nonparametric test applies to both variables and attributes. 7. For measuring the degree of association between two quantitative variables, Pearson’s coefficient of correlation is used in the parametric test, while spearman’s rank correlation is used in the nonparametric test. 7. Chi Square Test:  A chi-squared test, also written as χ2 test, is any statistical hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Without other qualification, 'chi-squared test' often is used as short for Pearson's chi-squared test.  The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.  In the standard applications of this test, the observations are classified into mutually exclusive classes, and there is some theory, or say null hypothesis, which gives the probability that any observation falls into the corresponding class.  The purpose of the test is to evaluate how likely the observations that are made would be, assuming the null hypothesis is true.
  • 59.  Chi-squared tests are often constructed from a sum of squared errors, or through the sample variance.  Test statistics that follow a chi-squared distribution arise from an assumption of independent normally distributed data, which is valid in many cases due to the central limit theorem.  A chi-squared test can be used to attempt rejection of the null hypothesis that the data are independent.
  • 60. Unit VII: Experimental Designs 1. Principles of Experimental Design: The basic principles of experimental design are (i) Randomization, (ii) Replication and (iii) Local Control. 1.1. Randomization: Randomization is the cornerstone underlying the use of statistical methods in experimental designs. Randomization is the random process of assigning treatments to the experimental units. The random process implies that every possible allotment of treatments has the same probability. For example, if number of treatment = 3 (say, A, B, and C) and replication = r = 4, then the number of elements = t * r = 3 * 4 = 12 = n. Replication means that each treatment will appear 4 times as r = 4. Let the design is A C B C C B A B A C B A Note from the design elements 1, 7, 9, 12 are reserved for treatment A, element 3, 6, 8 and 11 are reserved for Treatment B and elements 2, 4, 5 and 10 are reserved for Treatment C. P(A)= 4/12, P(B)= 4/12, and P(C)=4/12, meaning that Treatment A, B, and C have equal chances of its selection. 1.2. Replication: The second principle of an experimental design is replication, which is a repetition of the basic experiment. In other words, it is a complete run for all the treatments to be tested in the experiment. In all experiments, some kind of variation is introduced because of the fact that the experimental units such as individuals or plots of land in agricultural experiments cannot be physically identical. This type of variation can be removed by using a number of experimental units. We therefore perform the experiment more than once, i.e., we repeat the basic experiment. An individual repetition is called a replicate. The number, the shape and
  • 61. the size of replicates depend upon the nature of the experimental material. A replication is used to: (i) Secure a more accurate estimate of the experimental error, a term which represents the differences that would be observed if the same treatments were applied several times to the same experimental units; (ii) Decrease the experimental error and thereby increase precision, which is a measure of the variability of the experimental error; and 1.3. Local Control: It has been observed that all extraneous source of variation is not removed by randomization and replication, i.e. unable to control the extraneous source of variation. Thus we need to a refinement in the experimental technique. In other words, we need to choose a design in such a way that all extraneous source of variation is brought under control. For this purpose we make use of local control, a term referring to the amount of (i) balancing, (ii) blocking and (iii) grouping of experimental units. Balancing: Balancing means that the treatment should be assigned to the experimental units in such a way that the result is a balanced arrangement of treatment. Blocking: Blocking means that the like experimental units should be collected together to far relatively homogeneous groups. A block is also a replicate. The main objective/ purpose of local control is to increase the efficiency of experimental design by decreasing the experimental error. 2. Longitudinal Study:  A longitudinal study (or longitudinal survey, or panel study) is a research design that involves repeated observations of the same variables (e.g., people) over short or long periods of time (i.e., uses longitudinal data).  It is often a type of observational study, although they can also be structured as longitudinal randomized experiments.  Longitudinal studies are often used in social-personality and clinical psychology, to study rapid fluctuations in behaviors, thoughts, and emotions from moment to moment or day to day; in developmental psychology, to study developmental trends across the life span.
  • 62.  Longitudinal studies can be retrospective (looking back in time, thus using existing data such as medical records or claims database) or prospective (requiring the collection of new data). 3. Cross Sectional Study:  In medical research and social science, a cross-sectional study (also known as a cross- sectional analysis, transverse study, prevalence study) is a type of observational study that analyzes data from a population, or a representative subset, at a specific point in time—that is, cross-sectional data.  In medical research, cross-sectional studies differ from case-control studies in that they aim to provide data on the entire population under study, whereas case-control studies typically include only individuals with a specific characteristic, with a sample, often a tiny minority, of the rest of the population.  Cross-sectional studies are descriptive studies (neither longitudinal nor experimental).  The study may be used to describe some feature of the population, such as prevalence of an illness, or they may support inferences of cause and effect.  Longitudinal studies differ from both in making a series of observations more than once on members of the study population over a period of time. 4. Prospective and Retrospective Study: 4.1. Prospective study  It is an epidemiologic study in which the groups of individuals (cohorts) are selected on the bases offactors that are to be examined for possible effects on some outcome.  For example, the effect of exposure to a specificrisk factor on the eventual development of a particular disease can be studied.  The cohorts are then followed over aperiod of time to determine the incidence rates of the outcomes being studied as they relate to the original factors. Called also cohort study. The term prospective usually implies a cohort selected in the present and followed into the future, but this method can also be applied to existing longitudinal historical data, such as insurance or medical records.  A cohort is identified and classified as to exposure to the risk factor at some date in the past and followed up to the present to determine incidence rates. This is called a historical prospective study, prospective study of past data, or retrospective cohort study.
  • 63. 4.2. Retrospective study:  It is an epidemiologic study in which participating individuals are classified as either having some outcome (cases) or lacking it (controls).  The outcome may be a specific disease, and the persons' histories are examined for specific factors that might be associated with that outcome.  Cases and controls are often matched with respect tocertain demographic or other variables but need not be.  As compared to prospective studies, retrospective studies suffer from drawbacks: certain important statistics cannot be measured, and large biases may be introduced both in the selection of controls and in the recall of past exposure to risk factors.  The advantage of the retrospective study is its smallscale, usually short time for completion, and its applicability to rare diseases, which would require study of very large cohorts in prospective studies. 5. Randomized Block:  The blocks method was introduced by S. Bernstein.  In the statistical theory of the design of experiments, blocking is the arranging of experimental units in groups (blocks) that are similar to one another.  Typically, a blocking factor is a source of variability that is not of primary interest to the experimenter.  In Probability Theory the blocks method consists of splitting a sample into blocks (groups) separated by smaller sub-blocks so that the blocks can be considered almost independent.  The blocks method helps proving limit theorems in the case of dependent random variables. Example: Gender Treatment Placebo Vaccine Male 250 250 Female 250 250
  • 64. Subjects are assigned to blocks, based on gender. Then, within each block, subjects are randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250 men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine. 6. Simple Factorial Design:  Factorial design is one of the many experimental designs used in psychological experiments where two or more independent variables are simultaneously manipulated to observe their effects on the dependent variables.  A simple factorial design is an experimental design where 2 or more levels of each variable are observed in combination 2 or more levels of each variable. Example: A university wants to assess the starting salaries of their MBA graduates. The study looks at graduates working in four different employment areas: accounting, management, finance, and marketing. In addition to looking at the employment sector, the researchers also look at gender. In this example, the employment sector and gender of the graduates are the independent variables, and the starting salaries are the dependent variables. This would be considered a 4×2 factorial design. 7. Analysis of Variance (ANOVA):  Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. It may seem odd that the technique is called “Analysis of Variance” rather than “Analysis of Means.”  As we can see, the name is appropriate because inferences about means are made by analyzing variance. ANOVA is used to test general rather than specific differences among means.  An ANOVA conducted on a design in which there is only one factor is called a ONE- WAY ANOVA.  If an experiment has two factors, then the ANOVA is called a TWO-WAY ANOVA. Example: Suppose an experiment on the effects of age and gender on reading speed were conducted using three age groups (8 years, 10 years, and 12 years) and the two genders (male
  • 65. and female). The factors would be age and gender. Age would have three levels and gender would have two levels. 8. Analysis of RBD:  A Reliability Block Diagram (RBD) is a graphical representation of the components of the system and how they are reliability-wise related.  The diagram represents the functioning state (i.e., success or failure) of the system in terms of the functioning states of its components.  For example, a simple series configuration indicates that all of the components must operate for the system to operate; a simple parallel configuration indicates that at least one of the components must operate, and so on.  When we define the reliability characteristics of each component, we can use software to calculate the reliability function for the entire system and obtain a wide variety of system reliability analysis results, including the ability to identify critical components and calculate the optimum reliability allocation strategy to meet a system reliability goal. 9. Meta-analysis:  A meta-analysis is a statistical analysis that combines the results of multiple scientific studies.  Meta-analysis can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting measurements that are expected to have some degree of error.  The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived.  Existing methods for meta-analysis yield a weighted average from the results of the individual studies, and what differs is the manner in which these weights are allocated and also the manner in which the uncertainty is computed around the point estimate thus generated.  In addition to providing an estimate of the unknown common truth, meta-analysis has the capacity to contrast results from different studies and identify patterns among study
  • 66. results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies. 10. Systematic Review:  Systematic reviews are a type of literature review that uses systematic methods to collect secondary data, critically appraise research studies, and synthesize findings qualitatively or quantitatively.  Systematic reviews formulate research questions that are broad or narrow in scope, and identify and synthesize studies that directly relate to the systematic review question.  They are designed to provide a complete, exhaustive summary of current evidence relevant to a research question.  For example, systematic reviews of randomized controlled trials are key to the practice of evidence-based medicine, and a review of existing studies is often quicker and cheaper than embarking on a new study.  While systematic reviews are often applied in the biomedical or healthcare context, they can be used in other areas where an assessment of a precisely defined subject would be helpful.  Systematic reviews may examine clinical tests, public health interventions, environmental interventions, social interventions, adverse effects, and economic evaluations. 11. Ethics in Statistics: Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations. In some situations, guideline principles may conflict, requiring individuals to prioritize principles according to context. However, in all cases, stakeholders have an obligation to act in good faith, to act in a manner that is consistent with these guidelines, and to encourage others to do the same. Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical. Ethical statistical practice does not include, promote, or tolerate any type of professional or scientific misconduct, including, but not limited to, bullying, sexual or other harassment, discrimination based on personal characteristics, or other forms of intimidation.
  • 67. A. Professional Integrity and Accountability: The ethical statistician uses methodology and data that are relevant and appropriate; without favoritism or prejudice; and in a manner intended to produce valid, interpretable, and reproducible results. The ethical statistician does not knowingly accept work for which he/she is not sufficiently qualified, is honest with the client about any limitation of expertise, and consults other statisticians when necessary or in doubt. It is essential that statisticians treat others with respect. The ethical statistician: 1. Identifies and mitigates any preferences on the part of the investigators or data providers that might predetermine or influence the analyses/results. 2. Employs selection or sampling methods and analytic approaches appropriate and valid for the specific question to be addressed, so that results extend beyond the sample to a population relevant to the objectives with minimal error under reasonable assumptions. 3. Respects and acknowledges the contributions and intellectual property of others. 4. When establishing authorship order for posters, papers, and other scholarship, strives to make clear the basis for this order, if determined on grounds other than intellectual contribution. 5. Discloses conflicts of interest, financial and otherwise, and manages or resolves them according to established (institutional/regional/local) rules and laws. 6. Accepts full responsibility for his/her professional performance. Provides only expert testimony, written work, and oral presentations that he/she would be willing to have peer reviewed. 7. Exhibits respect for others and, thus, neither engages in nor condones discrimination based on personal characteristics; bullying; unwelcome physical, including sexual, contact; or other forms of harassment or intimidation, and takes appropriate action when aware of such unethical practices by others. B. Integrity of data and methods: The ethical statistician is candid about any known or suspected limitations, defects, or biases in the data that may affect the integrity or reliability of
  • 68. the statistical analysis. Objective and valid interpretation of the results requires that the underlying analysis recognizes and acknowledges the degree of reliability and integrity of the data. The ethical statistician: 1. Acknowledges statistical and substantive assumptions made in the execution and interpretation of any analysis. When reporting on the validity of data used, acknowledges data editing procedures, including any imputation and missing data mechanisms. 2. Reports the limitations of statistical inference and possible sources of error. 3. In publications, reports, or testimony, identifies who is responsible for the statistical work if it would not otherwise be apparent. 4. Reports the sources and assessed adequacy of the data, accounts for all data considered in a study, and explains the sample(s) actually used. 5. Clearly and fully reports the steps taken to preserve data integrity and valid results. 6. Where appropriate, addresses potential confounding variables not included in the study. 7. In publications and reports, conveys the findings in ways that are both honest and meaningful to the user/reader. This includes tables, models, and graphics. 8. In publications or testimony, identifies the ultimate financial sponsor of the study, the stated purpose, and the intended use of the study results. 9. When reporting analyses of volunteer data or other data that may not be representative of a defined population, includes appropriate disclaimers and, if used, appropriate weighting. 10. To aid peer review and replication, shares the data used in the analyses whenever possible/allowable and exercises due caution to protect proprietary and confidential data, including all data that might inappropriately reveal respondent identities.
  • 69. 11. Strives to promptly correct any errors discovered while producing the final report or after publication. As appropriate, disseminates the correction publicly or to others relying on the results. C. Responsibilities to Science/Public/Funder/Client: The ethical statistician supports valid inferences, transparency, and good science in general, keeping the interests of the public, funder, client, or customer in mind (as well as professional colleagues, patients, the public, and the scientific community). The ethical statistician: 1. To the extent possible, presents a client or employer with choices among valid alternative statistical approaches that may vary in scope, cost, or precision. 2. Strives to explain any expected adverse consequences of failure to follow through on an agreed-upon sampling or analytic plan. 3. Applies statistical sampling and analysis procedures scientifically, without predetermining the outcome. 4. Strives to make new statistical knowledge widely available to provide benefits to society at large and beyond his/her own scope of applications. 5. Understands and conforms to confidentiality requirements of data collection, release, and dissemination and any restrictions on its use established by the data provider (to the extent legally required), protecting use and disclosure of data accordingly. Guards privileged information of the employer, client, or funder. D. Responsibilities to Research Subjects: The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records and subjects of physically or psychologically invasive research. The ethical statistician: