fundamentals of data science and analytics on descriptive analysis.pptx

Histogram
• A histogram is a special kind of bar graph that applies to
quantitative data(discrete or continuous). The horizontal axis
represents the range of data values. The bar height
represents the frequency of data values falling within the
interval formed by the width of the bar. The bars are also
pushed together with no spaces between them.
• A diagram consisting of rectangles whose area is
proportional to the frequency of a variable and whose width
is equal to the class interval.

Frequency polygon
• Frequency polygon are a graphical device fro understanding
for shapes of the distribution.
• Frequency polygon serve the same purpose as histograms
but it is helpful for comparing the sets of data
• Frequency polygon are a good choice for displaying
cumulative frequency distributions

• Variability, almost by definition, is the extent to which data
points in a statistical distribution or data set diverge, vary
from the average value, as well as the extent to which these
data points differ from each other. Variability refers to the
divergence of data from its mean value
• Central tendency describes the central point of the
distribution and variability describes the scores are scattered
around that central point
• Variability can be measured within the range, the
interquartile range and the standard deviation/ variance

Range
• The range is the total distance covered by the distribution,
from the highest score to the lowest score
• Range = maximum value – minimum value

Merits:-
• It is easier to compute.
• It can be used as a measure of variability where precision is
not required.
Demerits :-
• Its values depends on only two scores
• It is not sensitive to total condition of the distribution

Variance
• Variance is the expected value of the squared deviation of a
random variable from its mean.
• Variance is used in statistics for understanding of data set
distribution

Standard Deviation
• Standard deviation is simply the square of the variance.
Standard deviation measures the standard deviation
between a score and the mean

The Interquartile Range
• The interquartile range is the distance covered by the middle
50% of the distribution.

Variability for qualitative and Ranked Data
• Data is collection of facts and figures which relay something
specific, but which are not organized in any way
• Data set is a collection of related records or information . The
information may be on some entity or some subject area
• Collection of data objects and their attributes. Attributes
caputured the basic characteristics of an object
• Each row of a data set is called a record. Each data set also
has multiple attributes, each of which gives information on a
specific characteristics.

Qualitative And Quantitative data
• Qualitative :- Qualitative data provides information about the quality
of an object or information which cannot be measured. Qualitative
data cannot be expressed in number.
• Qualitative data is data concerned with description which can be
observed but cannot be computed
• Qualitative data is also known as categorical data
• Qualitative data can be further subdivided into two types
• 1) Nominal data
• 2) ordinal data

Scales of Measurement
Qualitative Quantitative
Numerical Numerical
Nonnumerical
Data
Nominal Ordinal Nominal Ordinal Interval Ratio

Quantitative data
• Quantitative data is the one that focus on numbers and
mathematical calculations and the calculated value can be
computed
• Quantitative data that can be expressed as a number or
quantified
• Quantitative data are of two types of quantitative data :-
• 1 ) Interval
• 2) Ratio data

Difference between Quantitative data and Qualitative
data
`Qualitative data Quantitative data
Qualitative data provides
information about the
quality of an object or
information which cannot
be measured
Quantitative data relates to
information about the
quantity of a object , hence it
can be measured
Types : nominal data and
ordinal data
Types : interval data and ratio
data
They are descriptive rather
than numerical in nature
Expressed in numerical form
Example : The team is well
prepared
Example : The team has 7
players.

Advantages and Disadvantages of Qualitative
data
• Advantages:-
• It helps in depth analysis
• Avoid pre-judgements
• Disadvantages:-
• Time consuming
• Not easy to generalize

Advantages and Disadvantages of Quantitative
data
• Advantages :-
• Easier to summarize and make comparisons
• It is often easier to obtain large sample size
• Disadvantages:-
• The cost is relatively high
• There is no accurate generalization of the data

Ranked data
• Ranked data is variable in which the value of the data is captured
from an ordered set which is recorded in the order of magnitude
• Ordinal represents the “order”. Ordinal dta is also known as
qualitative data or categorical data
• Characteristics of the ranked data:-
• 1)The ordinal data shows the relative ranking of the variables
• 2)The interval properties are not known
• 3)It identifies and describes the magnitude of a variable
• Examples :-
• a) level of agreement :- yes , no
• Time of a day :- Morning, Evening, Afternoon ,Night

Scale of measurement
• Scales of measurement is also known as levels of
measurements. Each level of measurement scale has a
specific properties that determine the various use of
statistical analysis
• There are four types of scales of measurement
• Nominal
• Ordinal
• Interval
• Ratio

Nominal
• Data are labels or names used to identify an attribute of the
element.
• A nonnumeric label or numeric code may be used.
• Example:- Students of a university are classified by the dorm
that they live in using a nonnumeric label such as Farley,
Keenan, Zahm, Breen-Phillips, and so on.

Ordinal
• The data have the properties of nominal data and the order or
rank of the data is meaningful
• A nonnumeric label or numeric code may be used.
• Example :- Students of a university are classified by their class
standing using a nonnumeric label such as Freshman,
Sophomore, Junior, or Senior.

Interval
• The data have the properties of ordinal data, and the interval
between observations is expressed in terms of a fixed unit of
measure.
• Interval data are always numeric.
• Example :- : Average Starting Salary Offer 2003
• Economics/Finance: $40,084
• History: $32,108
• Psychology: $27,454

Ratio
• The data have all the properties of interval data and the ratio of
two values is meaningful.
• Variables such as distance, height, weight, and time use the
ratio scale.
• This scale must contain a zero value that indicates that nothing
exists for the variable at the zero point.
• Example :- Econ & Finance majors salaries are 1.24 times
• History major salaries and are 1.46 times
• Psychology major salaries

Normal distribution and z-scores
• The normal distribution is a continuous probability
distribution that is symmetrical on both sides.
• The normal distribution is often called the bell curve because
the graph of its probability looks like a bell and it is also
known as gaussian distribution
• A normal distribution is determined by two parameters the
mean and the variance. A normal distribution with mean 0
and standard deviation 1 is a standard normal distribution

Z-scores
• The z-score or standard score is a fractional representation of
standard deviation from the mean value
• A score consists of two parts
• a) positive or negative sign indicating whether it’s above or below the
mean
• b) number indicating the size of its deviation from the mean in SD
units

Why are z-scores important?
• It is useful to standardize the values of a normal distribution
by converting them into z-scores
• Using the z-score technique , one can compare two different
test results based on relative performance , not on individual
scale

Correlation
• Correlation refers to a relationship between two or more
objects. In statistics the word correlation refers to the
relationship between two variables
• Covariance is the extent to which a change in one variable
corresponds systematically to change in order

Types of correlation
• Positive and negative
• Simple and multiple
• Partial and total
• Linear and non-linear

• Positive correlation:- Association between variables such that high
scores on one variable tends to have high scores on the other
variable. A direct relation between the variables.
• Negative correlation:- Association between variables such that high
scores on one variable tends to have high scores on the other
variable. A inverse relation between the variables.

Simple and multiple
• Single:- it is about the study of only two variables, the
relationships is described as a simple correlation
• Example:- quantity of money
• Multiple :- it is about the study of more than two variables
simultaneously , the relationship is described as a multiple
correlation
• Example:- the relationships of price

Partial and total correlation
• Partial correlation :- analysis recognizes more than two
variables but considers only two variables keeping the other
constant
• Total correlation :- Total correlation is based on all the
relevant variables, which is normally not feasible in total
correlation, all the facts are taken into account

Linear and non-linear correlation
• Linear correlation : - correlation is said to be linear when the amount
of change in one variable tends to bear a constant ratio to the
amount of change in the other
• Non linear correlation :- correlation is said to be non linear if the
amount of change in one variable does not bear a constant ratio to
the amount of change in the other

Classification of correlation
• Two methods are used for finding relationship between variables:-
• Graphic methods
• Mathematical methods
• Graphic methods are further divided into scatter diagram and simple
graph
• Types of mathematical methods
• Karl Pearson's Coefficient of Correlation
• Spearman's Rank Correlation Coefficient
• Coefficient of concurrent deviation
• Method of least squares

Coefficient of correlation
• Correlation : the degree of the relationship between the
variables under consideration is measure through the
correlation analysis
• The measure of correlation called the correlation Coefficient
• If two variables vary in the movement in one are
accompanied in other these variables are called as cause and
effect relationship
• The degree of the relationship is expressed by (-1<= r >= +1)

Properties of correlation
• Correlation requires that both variables be
quantitative
• The correlation coefficient is always between -1 and
+1
• The correlation coefficient is a pure number without
units
• The correlation can be misleading in the presence of
outliers or non linear associations
• Correlation measures association

Scatter plots
• When two variables x and y have an association (or
relationship), we say there exists a correlation between them.
Alternatively, we could say x and y are correlated
• One variable is called independent (X) and the second is called
dependent(Y)
• Scatterplot is a graph in which the paired (x,y) sample data are
plotted with a horizontal x axis and vertical y axis

Advantages and Disadvantages scatter
diagram
• Advantages :-
• It is a simple to implement and attractive method to find out
the nature of correlation
• It is easy to understand
• User will get rough idea about correlation
• Not influenced by the size of extreme item
• First step in investing the relationship between two variables
Disadvantages:-
• Can not adopt an exact degree of correlation

Correlation coefficient for quantitative data
• The product moment correlation , r, summarizes the strength of
association between two metric variables X and Y
• It is an index used to determine a linear or a straight lime relationship
between X and Y
• It measures the nature and strength between two variables of the
quantitative data
• The sign of r denotes the nature of association. While the value of r
denotes the strength of association
• The value of r range between -1 and +1

• If r = 0 -> no correlation
• If 0 < r < 0.25-> weak correlation
• If 0.25<= r < 0.75->Intermediate correlation
• If 0.75 <= r < 1 -> strong correlation
• r = 1 -> perfect correlation

Regression
• X, if the output is continuous this is called a regression problem.
• Regression is concerned with the prediction of continuous quantities.
Linear regression is the oldest and widely used predictive model in
the field of machine learning
• The goal of the regression is to minimize the sum of the squared
errors to a fit straight line to a set of data points
• It is one of the supervised learning algorithm. A regression model
requires the knowledge of both the dependent and the independent
variables in the training data set
• Simple linear regression is a statistical model in which there is only
one independent and the functional relationship between the
dependent variable and the regression coefficient is linear
• Regression is the line which gives the best estimate of one variable

Regression line of Y and X
• Y = a + b(x)
• Where
• A -> Y – intercept
• B -> slope of the line
• Y -> dependent variable
• X -> independent variable

Regression line
• A way of making a somewhat precise prediction based
upon the relationships between two variables
• Regression line is placed so that it minimizes the
predictive error
• A negative residual indicates that the model is over
predicting
• A positive residual indicates that the model is under-
prediction

Linear regression
• The simplest form of regression to visualize is linear
regression with a single predicate. A linear regression
technique can be used if the relationship between X and Y
can be approximated with a straight line

Non linear regression
• Non linear regression is used when it cannot be approximated with a
straight line
• The X and Y have a nonlinear relationship
• There are two important shortcomings of linear regression
• 1) predictive ability :- the linear regression fit often has low bias but
high variance
• 2) interpretative ability :- linear regression freely assigns a coefficient
to each predictor variable

Least Squares Regression Line
• The method of least squares is about estimating parameters by
minimizing the squared discrepancies between observed data
• The least squares (LS) criterion states that the sum of the squares of
errors is minimum. The least-squares solution yield y(x) whose
element sum 1, but do not ensure the outputs to be in the in the
range[0,1]
• The process of getting parameter estimators is called estimators.
Least squares method is the estimation method of Ordinary Least
Squares (OLS).

Disadvantages of least square
• Last robustness to outliers
• Certain datasets unsuitable for least squares classification
• Decision boundary corresponds to ML solution

Interpretation of R(square)
• The following measures are used to validate the simple linear
regression models:
• Coefficient of determination (R-square)
• Outliers analysis
• Residual analysis to validate the regression model
• Hypothesis test for the regression coefficient b₁

Characteristics of R-square
• R-square is a proportion , it is always a number between 0 and 1
• R(square) = 1 -> all of the data points full perfectly on the regression
line. The predictor x accounts for all the variation in y!
• Coefficient of determination R(square) a measure that assesses the
ability of a model to predict or explain the linear regression setting.
• More R₂ indicates the model good fit

Spurious regression
• The regression is spurious when we regress one random walk onto
independent random walk
• The coefficient estimate will not converge toward 0
• The t value most often is significant.
• R₂ is typically very high
• Spurious regression is linked to serially

Hypothesis test for regression co-efficient (t-
test)
• The regression co-efficient captures the existence of a of a linear
relationship between the response variable and the explanatory
variable
• Using the analysis of variance (ANOVA), we can whether the overall
model is statistically significant

Residual analysis
• Residual (error) analysis is important to check whether the
assumption of regression models have been satisfied
• The residuals are normally distributed
• If there are any outliers
• The functional form of regression is correctly specified
• The variance of residual is constant

Multiple regression equation
• Multiple linear regression is an extension of linear regression , which
allows a response variable , y to be modelled as a linear function of as
a linear function or more predictor variables
• In a multiple regression model, two or more independent variables,
prediction are involved in a model. The simple linear regression
model and the multiple regression model assume that the dependent
variable is continuous

Difference between Simple and Multiple
regression
Simple regression Multiple regression
One dependent variable Y
predicted from one independent
variable X
One dependent variable Y
predicted from a set of
independent (X1,X2,……….Xn)
One regression coefficient One regression coefficient for each
independent variables

fundamentals of data science and analytics on descriptive analysis.pptx

Recommended

Recommended

More Related Content

Similar to fundamentals of data science and analytics on descriptive analysis.pptx

Similar to fundamentals of data science and analytics on descriptive analysis.pptx (20)

Recently uploaded

Recently uploaded (20)

fundamentals of data science and analytics on descriptive analysis.pptx