DESCRIBING DATA
ANALYSIS II
FINDING PROPORTIONS:
WHAT IS A PROPORTION?
A proportion is a way of expressing the relationship
between a part and a whole. It tells you how much one
quantity compares to another.
For example, if you have a group of 20 people, and 15 of
them have brown hair, the proportion of people with
brown hair is 15/20 OR 3/4​
. This means that 3 out of
every 4 people in the group have brown hair.
WHAT IS PROBABILITY
DISTRIBUTION?
• A probability distribution is a mathematical function that
describes the likelihood of different outcomes in a sample
space. It provides a set of probabilities for each possible
outcome, and the total probability across all outcomes is
equal to 1. Probability distributions can take various
forms, and one specific type is the normal distribution.
WHAT IS NORMAL DISTRIBUTION
• This is a specific type of probability distribution
characterized by a symmetric bell-shaped curve. It is
completely defined by its mean and standard deviation.
The normal distribution is widely used in statistics and
probability theory due to its mathematical properties
and its frequent occurrence in natural phenomena.
AREA UNDER A PROBABILITY
DISTRIBUTION
• The "area under a probability distribution" refers to the
total probability contained within a certain range of
values on the distribution. In the context of a probability
distribution, the area under the curve represents the
probability of observing values within a specific interval
or range.
TYPES OF DISTRIBUTION
NORMAL DISTRIBUTION
Since the normal distribution is a probability distribution
and since areas under a probability distribution represent
probabilities, the total area under a normal distribution must
be 1.
In general, areas under the normal distribution represent
proportions of a population.
For example, about 95% of the population lies within 2
standard deviations of the mean in a normal distribution, so
the area under a normal distribution between µ − 2σ and µ +
2σ is 0.95.
The cumulative proportion for a value x in a distribution is
the proportion of observations in the distribution that lie at
or below x.
FINDING SCORE IN NORMAL
CURVE
z = (X – μ) / σ
where X is a normal random variable, μ is the mean of X,
and σ is the standard deviation of X. You can also find the
normal distribution formula here.
CORRELATION
WHAT IS CORRELATION
Correlation is a statistical measure that expresses the extent to
which two variables are linearly related (meaning they change
together at a constant rate).
CORRELATION
• Value lies between -1 to +1.
SCATTER PLOT:
• It is a method for measuring linear correlation.
• Simplest method
• First variable is independent and plotted along the x axis ,and
second variable is dependent on the first variable and is plotted
along y axis.
• Diagram helps in determining how closely two variables are
related
CORRELATION COEFFICIENT:
• Correlation coefficients are used to calculate how vital a
connection is between two variables. There are different
types of correlation coefficients, one of the most popular
is Pearson’s correlation (also known as Pearson’s R)which
is commonly used in linear regression.
• Linear regression is a supervised learning algorithm that
compares input (X) and output (Y) variables based on
labeled data. It’s used for finding the relationship between
the two variables and predicting future results based on
past relationships.
How to Find the Correlation Coefficient
• Correlation is used almost everywhere in statistics.
Correction illustrates the relationship between two or
more variables. It is expressed in the form of a number
that is known as the correlation coefficient.
There are mainly two types of correlations:
• Positive Correlation
• Negative Correlation
PROPERTIES OF CORRELATION
COEFFICIENT
• The correlation coefficient remains in the same measurement as in which the
two variables.
• The sign that correlations of coefficient have will always be the same as the
variance.
• The numerical value of the correlation of coefficient will be between -1 to + 1. It
is known as the real number value.
• The negative value of the coefficient suggests that the correlation is strong and
negative. And if ‘r’ goes on approaching -1, then it means that the relationship
is going towards the negative side.
• When ‘r’ approaches the side of + 1, then it means the relationship is strong
and positive. By this, we can say that if +1 is the result of the correlation, then
the relationship is in a positive state.
• The weak correlation is signalled when the coefficient of correlation approaches
zero. When ‘r’ is near zero, then we can deduce that the relationship is weak.
Correlation coefficient Formula
• The correlation coefficient procedure is used to determine how
strong a relationship is between the data. The correlation
coefficient procedure yields a value between 1 and -1. In which,
• -1 indicates a strong negative relationship
• 1 indicates strong positive relationships
• And an outcome of zero implies no connection at all
• A correlation coefficient of -1 means there is a negative
decrease of a fixed proportion, for every positive increase
in one variable. Like, the amount of gas in a tank
decreases in a perfect correlation with the speed.
• A correlation coefficient of 1 means there is a positive
increase of a fixed proportion of others, for every positive
increase in one variable. Like, the size of the shoe goes up
in perfect correlation with foot length.
• Zero means that for every increase, there is neither a
positive nor a negative increase. The two just aren’t
related.
OUTLIERS
Outliers were defined as very extreme scores that require
special attention because of their potential impact on a
summary of data.
This is also true when outliers appear among sets of paired
scores.

DESCRIBING DATA ANALYSIS IN DATA SCIENCE

  • 1.
  • 2.
    FINDING PROPORTIONS: WHAT ISA PROPORTION? A proportion is a way of expressing the relationship between a part and a whole. It tells you how much one quantity compares to another. For example, if you have a group of 20 people, and 15 of them have brown hair, the proportion of people with brown hair is 15/20 OR 3/4​ . This means that 3 out of every 4 people in the group have brown hair.
  • 3.
    WHAT IS PROBABILITY DISTRIBUTION? •A probability distribution is a mathematical function that describes the likelihood of different outcomes in a sample space. It provides a set of probabilities for each possible outcome, and the total probability across all outcomes is equal to 1. Probability distributions can take various forms, and one specific type is the normal distribution.
  • 4.
    WHAT IS NORMALDISTRIBUTION • This is a specific type of probability distribution characterized by a symmetric bell-shaped curve. It is completely defined by its mean and standard deviation. The normal distribution is widely used in statistics and probability theory due to its mathematical properties and its frequent occurrence in natural phenomena.
  • 5.
    AREA UNDER APROBABILITY DISTRIBUTION • The "area under a probability distribution" refers to the total probability contained within a certain range of values on the distribution. In the context of a probability distribution, the area under the curve represents the probability of observing values within a specific interval or range.
  • 6.
  • 7.
    NORMAL DISTRIBUTION Since thenormal distribution is a probability distribution and since areas under a probability distribution represent probabilities, the total area under a normal distribution must be 1. In general, areas under the normal distribution represent proportions of a population. For example, about 95% of the population lies within 2 standard deviations of the mean in a normal distribution, so the area under a normal distribution between µ − 2σ and µ + 2σ is 0.95.
  • 8.
    The cumulative proportionfor a value x in a distribution is the proportion of observations in the distribution that lie at or below x.
  • 9.
    FINDING SCORE INNORMAL CURVE z = (X – μ) / σ where X is a normal random variable, μ is the mean of X, and σ is the standard deviation of X. You can also find the normal distribution formula here.
  • 10.
  • 11.
    WHAT IS CORRELATION Correlationis a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate).
  • 12.
    CORRELATION • Value liesbetween -1 to +1.
  • 16.
    SCATTER PLOT: • Itis a method for measuring linear correlation. • Simplest method • First variable is independent and plotted along the x axis ,and second variable is dependent on the first variable and is plotted along y axis. • Diagram helps in determining how closely two variables are related
  • 20.
    CORRELATION COEFFICIENT: • Correlationcoefficients are used to calculate how vital a connection is between two variables. There are different types of correlation coefficients, one of the most popular is Pearson’s correlation (also known as Pearson’s R)which is commonly used in linear regression. • Linear regression is a supervised learning algorithm that compares input (X) and output (Y) variables based on labeled data. It’s used for finding the relationship between the two variables and predicting future results based on past relationships.
  • 21.
    How to Findthe Correlation Coefficient • Correlation is used almost everywhere in statistics. Correction illustrates the relationship between two or more variables. It is expressed in the form of a number that is known as the correlation coefficient. There are mainly two types of correlations: • Positive Correlation • Negative Correlation
  • 23.
    PROPERTIES OF CORRELATION COEFFICIENT •The correlation coefficient remains in the same measurement as in which the two variables. • The sign that correlations of coefficient have will always be the same as the variance. • The numerical value of the correlation of coefficient will be between -1 to + 1. It is known as the real number value. • The negative value of the coefficient suggests that the correlation is strong and negative. And if ‘r’ goes on approaching -1, then it means that the relationship is going towards the negative side. • When ‘r’ approaches the side of + 1, then it means the relationship is strong and positive. By this, we can say that if +1 is the result of the correlation, then the relationship is in a positive state. • The weak correlation is signalled when the coefficient of correlation approaches zero. When ‘r’ is near zero, then we can deduce that the relationship is weak.
  • 24.
    Correlation coefficient Formula •The correlation coefficient procedure is used to determine how strong a relationship is between the data. The correlation coefficient procedure yields a value between 1 and -1. In which, • -1 indicates a strong negative relationship • 1 indicates strong positive relationships • And an outcome of zero implies no connection at all
  • 25.
    • A correlationcoefficient of -1 means there is a negative decrease of a fixed proportion, for every positive increase in one variable. Like, the amount of gas in a tank decreases in a perfect correlation with the speed. • A correlation coefficient of 1 means there is a positive increase of a fixed proportion of others, for every positive increase in one variable. Like, the size of the shoe goes up in perfect correlation with foot length. • Zero means that for every increase, there is neither a positive nor a negative increase. The two just aren’t related.
  • 29.
    OUTLIERS Outliers were definedas very extreme scores that require special attention because of their potential impact on a summary of data. This is also true when outliers appear among sets of paired scores.