SlideShare a Scribd company logo
1 of 125
Download to read offline
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Data Science using R, Minitab & XLMiner
R, Minitab XLMiner for Forecasting
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
PMP
PMI-ACP
PMI-RMP
CSM
LSSGB
Project Management Professional
Agile Certified Practitioner
Risk Management Professional
Certified Scrum Master
Lean Six Sigma Green Belt
LSSBB
SSMBB
ITIL
Lean Six Sigma Black Belt
Six Sigma Master Black Belt
Information Technology Infrastructure Library
Agile PM Dynamic System Development Methodology Atern
Name: Bharani Kumar
Education: IIT Hyderabad
Indian School of Business
Professional certifications:
My Introduction
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
HSBC
Driven using UK policies
ITC Infotech
Driven using Indian policies SME
Infosys
Driven using Indian policies under Large enterprises
Deloitte
Driven using US policies
1
2
3
4
My Introduction
RESEARCH in
ANALYTICS, DEEP
LEARNING & IOT
DATA SCIENTIST
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Tuckman Model
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
AGENDA
Data
Visualization
using Tableau
Data Mining –
Supervised &
Unsupervised
(Machine
Learning)
Text Mining &
NLP
AGENDA
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
What does it take to be a DATA SCIENTIST?
Successful Data Scientist
All Agenda
Topics
Domain
Knowledge
Practice
Statistical
Analysis
Data
Minin
g
Forecasting
Data
Visualizatio
n
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Welcome to the Information Age …
… drowning in data and starving for Knowledge
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
500 million tweets every day, 1.3 billion accounts
YouTube users upload 100 hours of video every minute
306 items are purchased every second
26.6 Million transactions per day
100 terabytes of data uploaded daily
http://www.dnaindia.com/scitech/report-facebook-saw-
one-billion-simultaneous-users-on-aug-24-2119428
Processing 100 petabytes a day (1 petabyte
= 1000 terabytes)
More than 1 million customer transactions every hour
BIG DATA!
https://www.techinasia.com/alibaba-crushes-records-brings-143-billion-singles-day
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Why Tableau?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Why Tableau?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Why Tableau?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Why Tableau?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1
2
3
4
5
Data Types – Continuous, Discrete, Nominal, Ordinal, Interval,
Ratio, Random Variable, Probability, Probability Distribution
First, second, third & fourth moment business decisions
Graphical representation – Barplot, Histogram, Boxplot, Scatter
diagram
Simple Linear Regression
Hypothesis Testing
Agenda – Basic Statistics
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Data Types – Continuous & Discrete
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Data Types – Preliminaries
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Random Variable
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distribution
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Applications
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Population
Sampling Frame
SRS
Sample
Sampling Funnel
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Central Tendency Population Sample
Mean / Average
Median Middle value of the data
Mode Most occurring value in the data
Measures of Central Tendency
“Every American should have above average income, and my
Administration is going to see they get it.” – American President
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Measures of Dispersion
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Dispersion Population Sample
Variance
Standard Deviation
Range Max – Min
Measures of Dispersion
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
 For a probability distribution, the mean of the distribution is known as the expected
value
 The expected value intuitively refers to what one would find if they repeated the
experiment an infinite number of times and took the average of all of the outcomes
 Mathematically, it is calculated as the weighted average of each possible value
Expected Value
The formula for calculating the
expected value for a discrete random
variable X, denoted by μ, is:
The variance of a discrete random
variable X, denoted by σ2 is
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Graphical Techniques – Bar Chart
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Graphical Techniques – Histogram
A Histogram Represents the frequency distribution, i.e., how many observations
take the value within a certain interval.
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Skewness & Kurtosis
• A measure of asymmetry in the distribution
• Mathematically it is given by
E[(x-µ/σ)]3
• Negative skewness implies mass of the
distribution is concentrated on the
right
Third and Fourth moments
Skewness Kurtosis
• A measure of the “Peakedness” of
the distribution
• Mathematically it is given by
E[(x-µ/σ)]4 -3
• For Symmetric distributions, negative
kurtosis implies wider peak and thinner
tails
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Graphical Techniques – Box Plot
Box Plot : This graph shows the distribution of data by dividing
the data into four groups with the same number of data points
in each group. The box contains the middle 50% of the data
points and each of the two whiskers contain 25% of the data
points. It displays two common measures of the variability or
spread in a data set
Range : It is represented on a box plot by the
distance between the smallest value and the largest
value, including any outliers. If you ignore outliers,
the range is illustrated by the distance between the
opposite ends of the whiskers
Range(IQR): The
middle half of a data
set falls within the
inter- quartile range
Inter-
quartile
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Normal Distribution
 The Probability associated with any single value of a random variable is always zero
 Area under the entire curve is always equal to 1
 The normal random variable takes values from -∞ to +∞
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Normal Distribution
Characterized by
a bell shaped
curve
Has the following
properties:
68.26% of values
lie within ±1 σ
from the mean
95.46% of the
values lie within
±2 σ from the
mean
99.73% of the
values lie within ±
3σ from the
mean
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Normal Distribution
Characterized by
mean, µ, and
standard deviation, σ
X~N(µ,σ)
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Z scores, Standard Normal Distribution
• For every value (x) of the random variable X, we can calculate Z score:
• Interpretation − How many standard deviations away is the value from the mean ?
Z =
X−µ
σ
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Calculating Probability from Z distribution
Suppose GMAT scores can be reasonably modelled using a normal distribution
− µ = 711 σ = 29
What is p(x ≤ 680)?
Step 1: Calculate Z score corresponding to 680
- Z = (680-711)/29 = -1.06
Step 2: Calculate the probabilities using Z – Tables
- P(Z ≤ -1) = 0.14
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Calculating Probability from Z distribution
• What is P( 697 ≤ X ≤ 740) ?
• Step 1 : Use P(x1 ≤ X ≤ x2) = Use P( X ≤ x2) − P( X ≤ x1)
• Step 2 : Calculate P( X ≤ x2) and P( X ≤ x1) as before
P( X ≤ 740) = P( Z ≤ 1) = 0.84 ; P( X ≤ 697) = P( Z ≤ - 0.5) = 0.31
• Step 3 : Calculate P( 697 ≤ X ≤ 740 ) = 0.84 – 0.31 = 0.53
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Normal Quantile (Q-Q) Plot
SampleQuantiles
Theoretical Quantiles
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Sampling variation
 Sample mean can be (and most likely is) different from the population mean
 Sample mean varies from one sample to another
 Sample mean is a random variable
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Central Limit Theorem
The standard error of the mean estimates the variability between samples whereas the
standard deviation measures the variability within a single sample
The Distribution of the sample mean
- will be normal when the distribution of data in the population is normal
- will be approximately normal even if the distribution of data in the population is not normal
if the “sample size” is fairly large
Mean ( X ) = µ ( the same as the population mean of the raw data)
Standard Deviation (X) = , where σ is the population standard deviation and n is the sample size
- This is referred to as standard error of mean
_
σ
√𝑛
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Sample Size Calculation
A Sample Size of 30 is considered large enough, but that may /may not be adequate
More Precise conditions
- n > 10( K3 )2 , where ( K3 ) is sample skewness and
- n > 10( K4 ) , where ( K4) is sample kurtosis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Confidence Interval
• What is the Probability of tomorrow’s temperature being 42 degrees ?
Probability is ‘0’
• Can it be between [-50⁰C & 100⁰C] ?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Case Study: Confidence Interval
• A University with 100,000 alumni is thinking of offering a
new affinity credit card to its alumni.
• Profitability of the card depends on the average balance
maintained by the card holders.
• A Market research campaign is launched, in which about
140 alumni accept the card in a pilot launch.
• Average balance maintained by these is $1990 and the
standard deviation is $2833. Assume that the population
standard deviation is $2500 from previous launches.
• What we can say about the average balance that will be
held after a full−fledged market launch ?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Interval estimates of parameters
• Based on sample data
− The point estimate for mean balance = $1990
− Can we trust this estimate ?
• What do you think will happen if we took another random sample of 140 alumni ?
• Because of this uncertainty, we prefer to provide the estimate as an interval (range)
and associate a level of confidence with it
Interval
Estimate = Point Estimate ± Margin of Error
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Confidence Interval for the Population Mean
Start by choosing a confidence level (1-α) % (e.g. 95%, 99%, 90%)
Then, the population mean will be with in
X ± Z1-ᾳ where Z1-ᾳ satisfies p( -Z1-ᾳ ≤ Z ≤ Z1-ᾳ) = 1-ᾳσ
√𝑛
Margin of error depends on the underlying uncertainty, confidence level and sample size
_
Interval
Estimate =
Point Estimate ± Margin of Error
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Calculate Z value - 90%, 95% & 99%
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Confidence Interval Calculation
• Based on the survey and past data
• Construct a 95% confidence interval for the mean card balance and interpret it ?
• Construct a 90% confidence interval for the mean card balance and interpret it ?
− n = 140; σ = $2500; X = $ 1990
− σ X
_
- σ
√𝑛
= = 2500
√140
= 211.29
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Confidence Interval Interpretation
Consider the 95% Confidence interval for the mean income : [$1576, $2404]
Does this mean that
- The mean balance of the population lies in the range ?
- The mean balance is in this range 95% of the time ?
- 95% of the alumni have balance in this range ?
Interpretation 1 : Mean of the population has a 95% chance of being in this range for a random
sample
Interpretation 2 : Mean of the population will be in this range for 95% of the random samples
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
What if we don’t know Sigma?
• Suppose that the alumni of this university are very different and hence population standard
deviation from previous launches can not be used
We replace σ with our best guess (point estimate) s, which is the standard deviation of the sample:
Calculate
• If the underlying population is normally distributed , T is a random variable distributed
according to a t-distribution with n-1 degrees of freedom Tn-1
• Research has shown that the t-distribution is fairly robust to deviation of the population
of the normal model
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Student’s t-distribution
As n ꝏ
tn N(0,1)
i.e., as the degrees
of the freedom
increase, the
t-distribution
approaches the
standard
normal distribution
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Confidence Interval for mean with unknown Sigma
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Calculating t-value
• Construct a 95% confidence interval for the mean card balance and interpret it?
− n = 140; σ = $2500; X = $ 1990
− σ X
_
- = 2833
√140
= 239.46
Then the 95% confidence interval for balance is [$1516, $2464]
Calculate t0.95, 139 = 1.98
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Right Decision
Confidence
Type II error
Right Decision
Power
Type I error
Ho is TRUE H1 is TRUE
Fail to
Reject Ho
Reject Ho
Hypothesis Testing
1-α
1-β
Start with Hypothesis about a
Population Parameter
Collect Sample Information
Reject/Do Not Reject Hypothesis
The factors that affect the power of a test include sample size, effect size, population variability, and 𝛼.
Power and 𝛼 are related as increasing 𝛼 decreases 𝛽. Since power is calculated by 1 minus 𝛽, if you
increase 𝛼,You also increase the power of a test. The maximum power a test can have is 1, whereas
the minimum value is 0.
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Our quality will not improve
after the consulting project
We will acquire 8,000 new
customers if I open a store in
this area
We will need 400 more
person hours to finish this
project
The retail market will grow by
50% in the next 5 years
Our potential customers do
not spend more than 60
minutes on the web every day
Less than 5% clients will default
on their loans
Hypothesis Testing
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample Z test
The length of 25 samples of a fabric are taken at random. Mean
and standard deviation from the historic 2 years study are 150 and
4 respectively. Test if the current mean is greater than the historic
mean. Assume α to be 0.05
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
Population Standard
Deviation Known or Not
1 Sample Z Test
Stat > Basic Statistics >
1 Sample Z
2 3
Fabric
Data
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample Z test – Write Hypothesis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Y: Fabric Length is
continuous
X: Discrete 1 Population
We are comparing mean
with external standard
of 150mm
Data was
shown to
be normal
Population
standard
deviation is
known=4
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample t Test
The mean diameter of the bolt manufactured should be 10mm to
be able to fit into the nut. 20 samples are taken at random from
production line by a quality inspector. Conduct a test to check with
95% confidence that the mean is not different from the
specification value.
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
Population Standard
Deviation Known or Not
1 Sample t Test
Stat > Basic Statistics >
1 Sample t
2 3
Bolt
Diameter
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample t Test – Write Hypothesis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Y: Bolt Diameter is continuous
X: Discrete 1 Population
We are comparing mean with
external standard of 10mm
Data was
given to be
Normal
Population
standard
deviation is
NOT known
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample Sign Test
The scores of 20 students for the statistics exam are provided. Test
if the current median is not equal to historic median of 82. Assume
‘’ to be 0.05
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
1 Sample Sign Test
Stat > Non Parametric >
1 Sample sign
3
Student
Scores
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Sample Sign Test – Write Hypothesis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
2-Sample t Test
A financial analyst at a Financial institute wants to evaluate a
recent credit card promotion. After this promotion, 450
cardholders were randomly selected. Half received an ad
promoting a full waiver of interest rate on purchases made over
the next three months, and half received a standard Christmas
advertisement. Did the ad promoting full interest rate waiver,
increase purchases?
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
Variance Test
Stat > Basic Statistics >
2 Variance
2 Sample t Test
Stat > Basic Statistics >
2-Sample t
2 3
Marketing
Strategy
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
2-Sample t Test – Write Hypothesis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Paired T Test
• This test is used to compare the means of two sets of observations when all the
other external conditions are the same
• This is a more powerful test as the variability in the observations is due to
differences between the people or objects sampled is factored out
Example: To find out if medication A lowers blood pressure
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Trigger your thoughts!
Comparing the performance of machine A vs.
machine B by feeding different raw materials to
each machine
Compare the performance of machine A vs.
machine B when the same raw material is fed to
each machine
Compare the power output of a wind mill when you
use motor A for 1 month and motor B for 1 month
Compare the power output of two wind mills next
to each other simultaneously when you use
motor A on one wind mill and motor B on another
Identifying resistor defects and capacitor defects
in same PCB by collecting such data using 20 PCB
units
Identifying resister defects on 20 PCB’s and
capacitor defects on 20 (different) PCB’s
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
2-Sample t test or Paired T test
Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10
vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles,
additive to be tested is added with the fuel and their mileage is recorded. Find if the
mileage increases by adding the fuel additive.
Assume the same data was recorded if only 10 vehicles were chosen and mileage
was recorded before and after adding the additive. What method will you choose to
find the result.
2-Sample t test
Paired T test
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Mann-Whitney test
Effect of fuel additive on vehicles is being studied. Out of a total
of 20 vehicles, 10 vehicles are chosen randomly and mileage is
recorded. In rest of the 10 vehicles, additive to be tested is added
with the fuel and their mileage is recorded. Find if the mileage
increases by adding the fuel additive.
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
Mann – Whitney test
for Medians
Stat > Non Parametric >
Mann Whitney
2
Vehicle with
& without
Additives
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Mann-Whitney Test – Write Hypothesis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Paired T test
Effect of fuel additive on vehicles is being studied. Out of a total
of 20 vehicles, 10 vehicles are chosen randomly and mileage is
recorded. In rest of the 10 vehicles, additive to be tested is added
with the fuel and their mileage is recorded. Find if the mileage
increases by adding the fuel additive. Assume the same data was
recorded if only 10 vehicles were chosen and mileage was
recorded before and after adding the additive.
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
Paired T Test
Stat > Basic Statistic >
Paired T
2
Vehicle with
& without
Additives
• Since the data was not
normal, the cause of
non-normality was
investigated and it was
found that the first data
point for “with additive”
was wrongly entered.
This value should have
been 20. Now, proceed
with the rest of the
analysis.
• If the data were truly
non-normal our analysis
would stop here.
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Paired T test – Write Hypothesis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
One-Way ANOVA
A marketing organization outsources their back-office operations
to three different suppliers. The contracts are up for renewal and
the CMO wants to determine whether they should renew
contracts with all suppliers or any specific supplier. CMO want to
renew the contract of supplier with the least transaction time.
CMO will renew all contracts if the performance of all suppliers is
similar
Normality Test
Stat > Basic Statistics >
Graphical Summary
1
Variance Test
Stat > ANOVA >
Test for Equal Variances
ANOVA
Stat > ANOVA >
One-Way….
2 3
Contract
Renewal
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Example : More weight reduction programs
• She randomly assigns equal number of participants to each of these programs from
a common pool of volunteers
• Suppose the nutrition expert would like to do a comparative evaluation of three diet
programs(Atkins, South Beach, GM)
• Suppose the average weight losses in each of the groups(arms) of the experiments
are 4.5kg, 7kg, 5.3kg
• What can she conclude?
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Two kinds of variation matter
• Not every individual in each program will respond identically to the diet program
• Easier to identify variations across programs if variations within programs are smaller
• Hence the method is called Analysis of Variance(ANOVA)
• With-in group variation = Experimental Error
• Between group variation
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
• It should be obvious that for every observation : Totij = ti + eij
• What is more surprising and useful is:
Formalizing the intuition behind variations
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Statistically test for equality means
• n subjects equally divided into r groups
• Hypothesis
- H0: μ1 = μ2 = μ3 = … = μr
- Not all μi are equal
• Calculate
- Mean Square Treatment MSTR = SSTR / (r‐1)
- Mean Square Error MSE = SSE / (n‐r)
- The ratio of two squares f = MSTR/MSE = Between group variation/Within group variation
- Strength of this evidence p‐value = Pr(F(r‐1,n‐r) ≥ f)
• Reject the null hypothesis if p‐value < α
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Analysis of variance(ANOVA)
• ANOVA can be used to test equality of means when there are
more then 2 populations
• ANOVA can be used with one or two factors
• If only one factor is varying, then we would use a one-way
ANOVA
– Example: We are interested in comparing the mean performance of several departments within a
company. Here the only factor is the name of department
– If there are two factors, we would use a two way ANOVA. Example: One factor is department and the
second factor is the shift.(day vs. Night)
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Analysis of variance(ANOVA)
Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic
Between Treatments SSFactor K-1 MSFactor = SSFactor /
DFFactor
F = MSFactor /
MSError
Within Treatment SSError N-k MSError = SSError /
DFError
Total SSTotal N-1
Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic
Factor A SSA nA - 1 MSA = SSA / (nA – 1) FA = MSA / MSE
Factor B SSB nB - 1 MSB = SSB / (nB – 1) FB = MSB / MSE
Interaction A * B SSAB (nA – 1) (nB – 1) MSAB = SSAB / (nAB – 1) FAB = MSAB / MSE
Error SSE n – nA * nB MSE = SSE / (n – nA * nB)
Total SST n - 1
One Way ANOVA
Two Way ANOVA
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Is the Transaction time dependent
on whether person A or B processes
thetransaction?
Is medicine 1 effective or medicine 2
at reducing heart stroke?
Is the new branding program more
effective in increasing profits?
Does the productivity of employees vary
depending on the three levels?
(Beginner, Intermediate and Advanced)
Three different sale closing methods
were used. Which one is most effective?
Four types of machines are used. Is
weight of the Rugby ball dependent on
the type ofmachine used?
2 Sample t-test ANOVA – One Way
Dichotomies
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Non-Parametric equivalent to ANOVA
• When the data are not normal or if the data points are very few to figure out
if the data are normal and we have more than 2 populations, we can use the
Mood’s Median or Kruskal Wallis test to compare the populations
Ho : All the medians are the same
Ha : One of the medians is different
• Mood’s median assigns the data from each population that is higher than
the overall median to one group, and all points that are equal or lower to
another group. It then uses a Chi-Square test to check if the observed
frequencies are close to expected frequencies
• Kruskal Wallis is another test that is non-parametric equivalent of ANOVA.
Kruskal Wallis is the extension of Mann-Whitney test
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Mood’s Median & Kruskal Wallis
Growth is measured for three treatments as shown in the case
study. Compare the effect of the three treatments on growth.
Mood’s Median – handles outliers well
Stat > Nonparametric > Mood’s Median
1
Kruskal Wallis – more powerful than Mood’s Median
Stat > Nonparametric > Kruskal Wallis
2
Height
Growth
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Hypothesis Testing
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
1-Proportion Test
• A poll is carried out to find the acceptability of
new football coach by the people. It was
decided that if the support rate for the coach
for the entire population was truly less then
25%, the coach would be fired
• 2000 people participated and 482 people
supported the new coach
• Conduct a test to check if the new coach should
be fired with 95% level of confidence
Football
Coach
1-Proportion Test
Stat > Basic Statistics > 1-Proportion
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
2-Proportion Test
Johnnie Talkers soft drinks division sales manager has been planning
to launch a new sales incentive program for their sales executives.
The sales executives felt that adults (>40 yrs) won’t buy, children will
& hence requested sales manager not to launch the program.
Analyze the data & determine whether there is evidence at 5%
significance level to support the hypothesis
Johnnie
Talkers
Proportion A = Proportion B Check p-valueHo
Proportion A NOT = Proportion B
If p-value < alpha,
we reject Ho
Ha
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Test
How can you determine whether the distribution of defects in your product or service has changed
from the historic distribution over time, or exceeds an industry standard
• Do you think mean is more significant or variance?
Comparing population’s variance to a standard value involves calculating the
chi-square test statistic
We can also:
Determine whether one variable is dependent over another
Comparing observed & expected frequencies where variance is unknown.
This is called as goodness-of-fit test
Compare multiple proportions
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Goodness-of-fit test
Goodness-of-fit test is to test assumptions about the distributions that fit the
process data
Are observed frequencies (O) same or different from historical, expected or
theoretical frequencies (E)?
If there’s a difference between them, this suggests that the distribution
model expressed by the expected frequencies does not fit the data
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Test
• A city has a newly opened nuclear plant, and there are families staying
dangerously close to the plant. A health safety officer wants to take this case
up to provide relocation for the families that live in the surrounding area. To
make a strong case, he wants to prove with numbers that an exposure to
radiation levels is leading to an increase in diseased population. He
formulates a contingency table of exposure and disease.
• Does the data suggest an association between the disease and exposure?
Disease Total
Exposure Yes No
Yes 37 13 50
No 17 53 70
Total 54 66 120
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Test
Calculate the number of individuals of exposed and unexposed groups
expected in each disease category (yes and no) if the probabilities were
the same
If there were no effect of exposure, the probabilities should be same and
the chi-squared statistic would have a very low value.
Proportion of population exposed = (50/120) = 0.42
Proportion of population not exposed = (70/120) = 0.58
Thus, expected values:
Population with disease = 54
Exposure Yes : 54 * 0.42 = 22.5
Exposure No : 54 * 0.58 = 31.5
Population without disease = 66
Exposure Yes : 66 * 0.42 = 27.5
Exposure No : 66 * 0.58 = 38.5
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Test
• Calculate the Chi-squared statistic
χ2 = Σ =
= 29.1
• Calculate the degrees of freedom :
(Number of rows – 1) X (Number of columns – 1)
df = (2 – 1) X (2 – 1) = 1
• Calculate the p-value from the Chi-squared table
For chi-squared value 29.1 and degrees of freedom = 1, from the table, p-value is < 0.001
• Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and
observed values if there is no association
• Conclusion : There is an association between the exposure and disease
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Chi-Square Test
Bahamantech Research Company uses 4 regional centers in South Asia
(India, China, Srilanka and Bangladesh) to input data of questionnaire
responses. They audit a certain % of the questionnaire responses versus
data entry. Any error in data entry renders it defective. The chief data
scientist wants to check whether the defective % varies by country.
Analyze the data at 5% significance level and help the manager draw
appropriate inferences. [‘1’ means not defectives & ‘0’ means defective]
All proportions are equal Check p-valueHo
Not all proportions are equal
If p-value < alpha,
we reject Ho
Ha
Bahaman
Research
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Non-Parametric Tests
•Referred to as “distribution free”, as they don’t involve making
assumptions of any data
•They have lower power than the parametric tests and hence are
always given the second preference after the parametric tests
•These tests are typically focused on median rather than mean
•They involve straight-forward procedures like counting and ordering
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Thank You
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Lognormal:
• Fits many kinds of failure data
• Used for reliability analysis, cycles-to-failure, loading variables & fatigue stress
• Tensile strength of fibers & breaking strength of concrete
• Environment data such as random quantities of pollutants in water or air
• Economic variables such as per capita income
• Extreme values are well managed & makes data normal
• μ, σ are mean & standard deviation of natural logarithms
Data Log transformed
12 2.48
28 3.33
87 4.47
143 4.96
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Lognormal:
• This distribution is right skewed
• Skewness increases as value of σ increases
• Pdf starts at zero, increases to its mode, and then decreases
• If time-to-failure has a lognormal distribution, then the logarithm of time-to-failure has a normal
distirbution
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Exponential:
• Length of time between check-ins at a reception desk, calls at a call center, customers at a cashier
• Used when events occur continuously & independently at a constant average rate
• Used to model rate of change that will occur in a given amount of time
• How long equipment will keep working with proper maintenance & part replacement
• Use to model behavior of independent variables that have a constant rate
• The occurrences of variables are described by a Poisson distribution, but the times between occurrences
are described by Exponential distribution
• If X is Poisson distributed then Y = 1/X will be exponentially distributed
• # of arrivals at a checkout counter, # of product failures over time – Poisson
• Length of time between events, i.e., one arrival or failure & the next – Exponential distribution
• Exponential distribution can model the interval between random events
• λ = failure rate; θ = mean; x = random variable
• Used to model mean time between occurrences
• In exponential population, 37% of observations are below the mean &
63% are above
• Uses constant failure rate
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Weibull:
• Model failure rate; rate is not constant
• Model time to failure, time to repair & material strength
• When system/item ages & failure rate increases/decreases
• Can model different distributions due to having parameters of shape, scale & location
• Can simulate Lognormal, Exponential & many other distributions
• Use widely in reliability & statistical applications
• Weibull & Lognormal are from same family & both can be used to assess the dataset that
contains close to average values (not too high / low)
• However, Weibull is a better fit when majority of data falls to the higher side
• Lognormal is a better fit when majority of data falls to the lower side
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Weibull:
• β is shape parameter, also called as slope, determines the shape of the distribution
 When beta = 1, shape of distribution = exponential distribution
 When beta: 3 to 4, shape of distribution = normal distribution
 Several beta values can approximate lognormal distribution
• η is scaled parameter (eta), determines the spread or width of distribution
• γ is non-zero location parameter, is the point, below which there are no failures, changing the value will
move distribution to right or left
 Gamma > 0, there is a period when no failures occur
 Gamma < 0, failures have occurred before time equals zero
e.g., defective raw materials or failure during transportation
 When Gamma = 0, eta is called as characteristic life
• Regardless of specific value of beta, 63.2% of values fall below the characteristic life
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Bivariate Normal Distribution:
• Used when 2 variables that are normally distributed & may be totally independent or may be correlated to
some degree
• A joint distribution of two independent variables that simultaneously
& jointly cross-classifies the data
• Can be discrete or continuous
• 3D plot like mountain terrain
• X & Y axes represent independent variables
• Z axis shows either
 frequency for discrete data
 probability for continuous data
• The maximum or peak occurs when X1 = Mu1 & X2 = Mu2. You can take a
“slice” anywhere along the distribution by fixing one of the variables. This
is known as a conditional distribution
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Probability Distributions
Bivariate Normal Distribution:
• Can help determine items of critical importance:
• Causality – examine the joint frequencies to investigate if the second variable changes in a
systematic way when the first variable changes
• Predictions – reviewing outcomes from one variable as the other changes
• Importance – if two variables are causally related they should have a statistically significant impact
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Scatter Diagram
 Scatter diagrams or plots provides a graphical representation of the relationship of two
continuous variables
 Be Careful - Correlation does not guarantee causation. Correlation by itself does not
imply a cause and effect relationship!
 Judge strength of relationship by width or tightness of scatter
 Determine direction of the relationship, e.g. If X increases, and Y decreases, it is negative
correlation, similarly if X increases, and Y increases, it is positive correlation
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Correlation Analysis
 Correlation Analysis measures the degree of linear relationship between two variables
 Range of correlation coefficient -1 to +1
 Perfect positive relationship +1
 Perfect negative relationship -1
 No Linear relationship 0
 If the absolute value of the correlation coefficient is greater than 0.85, then we say there is a
good relationship
• Example: r = 0.87, r = -0.9, r = 0.9, r = -0.87 describe good relationship
• Example: r = 0.5, r = -0.5, r = 0.28 describe poor relationship
 Correlation values of -1 or 1 imply an exact linear relationship. However, the real value of
correlation is in quantifying less than perfect relationships
 We can perform regression analysis, which attempts to further describe this type of
relationship, if the correlation is good between the 2 variables
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Correlation Analysis
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Linear Regression Model
The equation that represents how an independent variable is related to a dependent variable
and an error term is a regression model
y = β0 + β1x + ε
Where, β0 and β1 are called parameters of the model,
ε is a random variable called error term.
β0
β1
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Linear Regression Model
y intercept
Error term
An observed value of x
when x equals x0
Mean value of
y when x
equals x0
Straight line defined by the
equation y = β0 + β1x
X
Y
x0 = A specific value of x, the
independent variable.
β0
β1
Fitting a straight line by least squares
ˆY = ˆb0 + ˆb1X
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Regression Analysis
 R-squared-also known as Coefficient of determination, represents the % variation in
output (dependent variable) explained by input variables/s or Percentage of response
variable variation that is explained by its relationship with one or more predictor variables
 Higher the R^2, the better the model fits your data
 R^2 is always between 0 and 100%
 R squared is between 0.65 and 0.8 => Moderate correlation
 R squared in greater than 0.8 => Strong correlation
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Regression Analysis
 Prediction and Confidence Interval are types of confidence intervals used for
predictions in regression and other linear models
 Prediction Interval: Represents a range that a single new observation is likely to fall given
specified settings of the predictors
 Confidence interval of the prediction: Represents a range that the mean response is likely
to fall given specified settings of the predictors
 The prediction interval is always wider than the corresponding confidence interval because
of the added uncertainty involved in predicting a single response versus the mean response
© 2013 - 2016 ExcelR Solutions. All Rights Reserved
Regression Techniques – Simple Linear Regression
Y = Continuous
X = Single &
Continuous
Simple
Linear
Regression
Y = Continuous
X = Single &
Discrete
Simple
Linear
Regression
Create
Dummy
Variable
109 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Dummy Variable
Gender Dummy Variable
Male 1
Female 0
Male 1
Female 0
Male 1
Male 1
Female 0
Male 1
Male 1
Female 0
110 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – R
A business problem:
The Waist Circumference – Adipose Tissue data
• Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of
cardio-vascular diseases
• Computed Tomography, commonly called the CT Scan is the only technique that allows for the precise and
reliable measurement of the AT (at any site in the body)
• The problems with using the CT scan are:
• Many physicians do not have access to this technology
• Irradiation of the patient (suppresses the immune system)
• Expensive
• Is there a simpler yet reasonably accurate way to predict the AT area? i.e.,
• Easily available
• Risk free
• Inexpensive
• A group of researchers conducted a study with the aim of predicting abdominal AT area using simple
anthropometric measurements, i.e., measurements on the human body
• The Waist Circumference – Adipose Tissue data is a part of this study wherein the aim is to study how well waist
circumference (WC) predicts the AT area
111 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Data Set
Observation Waist AT Observation Waist AT Observation Waist AT
1 74.75 25.72 38 103 129 75 108 217
2 72.6 25.89 39 80 74.02 76 100 140
3 81.8 42.6 40 79 55.48 77 103 109
4 83.95 42.8 41 83.5 73.13 78 104 127
5 74.65 29.84 42 76 50.5 79 106 112
6 71.85 21.68 43 80.5 50.88 80 109 192
7 80.9 29.08 44 86.5 140 81 103.5 132
8 83.4 32.98 45 83 96.54 82 110 126
9 63.5 11.44 46 107.1 118 83 110 153
10 73.2 32.22 47 94.3 107 84 112 158
11 71.9 28.32 48 94.5 123 85 108.5 183
12 75 43.86 49 79.7 65.92 86 104 184
13 73.1 38.21 50 79.3 81.29 87 111 121
14 79 42.48 51 89.8 111 88 108.5 159
15 77 30.96 52 83.8 90.73 89 121 245
16 68.85 55.78 53 85.2 133 90 109 137
17 75.95 43.78 54 75.5 41.9 91 97.5 165
18 74.15 33.41 55 78.4 41.71 92 105.5 152
19 73.8 43.35 56 78.6 58.16 93 98 181
20 75.9 29.31 57 87.8 88.85 94 94.5 80.95
21 76.85 36.6 58 86.3 155 95 97 137
22 80.9 40.25 59 85.5 70.77 96 105 125
23 79.9 35.43 60 83.7 75.08 97 106 241
24 89.2 60.09 61 77.6 57.05 98 99 134
25 82 45.84 62 84.9 99.73 99 91 150
26 92 70.4 63 79.8 27.96 100 102.5 198
27 86.6 83.45 64 108.3 123 101 106 151
28 80.5 84.3 65 119.6 90.41 102 109.1 229
29 86 78.89 66 119.9 106 103 115 253
30 82.5 64.75 67 96.5 144 104 101 188
31 83.5 72.56 68 105.5 121 105 100.1 124
32 88.1 89.31 69 105 97.13 106 93.3 62.2
33 90.8 78.94 70 107 166 107 101.8 133
34 89.4 83.55 71 107 87.99 108 107.9 208
35 102 127 72 101 154 109 108.5 208
36 94.5 121 73 97 100
112 Footer Copyright © 2015 ExcelR . All rights reserved.
Simple Linear Regression – Transformation
reg <- lm(AT ~ Waist) # Linear Regression
summary(reg)
confint(reg, level=0.95)
predict(reg, interval="predict”)
reg_log <- lm(AT ~ log(Waist)) # Regression using Logarithmic Transformation
summary(reg_log)
confint(reg_log, level=0.95)
predict(reg, interval="predict”)
reg_exp <- lm(log(AT) ~ Waist) # Regression using Exponential Transformation
summary(reg_exp)
confint(reg_exp, level = 0.95)
predict(reg, interval="predict”)
113 Footer Copyright © 2015 ExcelR . All rights reserved.
Regression Techniques – Multiple Linear Regression
Y = Continuous
X = Multiple &
Continuous
Multiple
Linear
Regression
Y = Continuous
X = Multiple
& Discrete
Multiple
Linear
Regression
Create
Dummy
Variable
114 Footer Copyright © 2015 ExcelR . All rights reserved.
Multiple Linear Regression – Dummy Variable
Make of car
Dummy
Variable_Petrol
Dummy
Variable_Diesel
Dummy
Variable_CNG
Dummy
Variable_LPG
Petrol 1 0 0 0
Diesel 0 1 0 0
CNG 0 0 1 0
LPG 0 0 0 1
Diesel 0 1 0 0
CNG 0 0 1 0
Petrol 1 0 0 0
LPG 0 0 0 1
Petrol 1 0 0 0
LPG 0 0 0 1
115 Footer Copyright © 2015 ExcelR . All rights reserved.
Multiple Regression Model
DATA : CARS, 81 observations, “cars.csv”
• VOL = cubic feet of cab space
• HP = engine horsepower
• MPG = average miles per gallon
• SP = top speed, miles per hour
• WT = vehicle weight, hundreds of pounds
Our interest is to model the MPG of a car based on the other variables
116 Footer Copyright © 2015 ExcelR . All rights reserved.
Model and Assumptions
Our Model:
① Linearity (Assumptions about the form of the model):
◦ Linear in parameters
② Assumptions about the errors:
◦ IID Normal (Independently & identically distributed)
◦ Zero mean
◦ Constant variance (Homoscedasticity)
◦ If no constant variance (HETEROSCEDASTICITY)
◦ Independent of each other. If not independent, it is called as AUTO CORRELATION problem
③ Assumptions about the predictors:
◦ Non-random
◦ Measured without error
◦ Linearly independent of each other. If not it is called as COLLINEARITY problem
④ Assumptions about the observations:
◦ Equally reliable
Y = b0 + b1X1 + b2X2 +......+ bk Xk +e Linear
Independent
Normal
Equal Variance
117 Footer Copyright © 2015 ExcelR . All rights reserved.
Techniques used for Discrete Output
Logit Analysis
Probit Analysis
Logistic Regression
1
3
2
118 Footer Copyright © 2015 ExcelR . All rights reserved.
Regression Techniques – Simple Logistic Regression
Y = Discrete
X = Single &
Continuous
Simple
Logistic
Regression
Y = Discrete
X = Single
& Discrete
Simple
Logistic
Regression
Create
Dummy
Variable
119 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression
• Logistic Regression model predicts the probability associated with each
dependent variable Category
How does it do this?
• It finds linear relationship between independent variables and a link
function of this probabilities. Then the link function that provides the
best goodness-of-fit for the given data is chosen
120 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression
Multiple Logistic Regression Model is quite similar to the Multiple
Linear Regression Model, Only β coefficients vary
121 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression
122 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression Methods
123 Footer Copyright © 2015 ExcelR . All rights reserved.
Assumptions in Logistic Regression
Only one outcome per event – Like pass or fail
The outcomes are statistically independent
All relevant predictors are in the model
One category at a time – Mutually exclusive &
collectively exhaustive
Sample sizes are larger than for linear regression
1
2
3
4
5
124 Footer Copyright © 2015 ExcelR . All rights reserved.
Steps in Logistic Regression
Collect & organize sample data
Formulate Logistic Regression Model
Check the model’s validity
Determine Probabilities using Probability equation
Compile the results
1
2
3
4
5
125 Footer Copyright © 2015 ExcelR . All rights reserved.
Logistic Regression Example
Imagine that you are a Data Scientist at a very large scale integration circuit
manufacturing company. You want to know whether or not the time spent
inspecting each product impacts the quality assurance department’s ability to
detect a designing error in the circuit
→ Step-1: Collect and organize the sample data
→ Number of Observations
→ Error Identification
→ Inspection Time
Number of Observations: 55 Observations of circuits with errors, and
determine whether those errors were detected by QA

More Related Content

What's hot

Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsRupak Roy
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionMohitMhapuskar
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)Abhimanyu Dwivedi
 
Describing quantitative data with numbers
Describing quantitative data with numbersDescribing quantitative data with numbers
Describing quantitative data with numbersUlster BOCES
 

What's hot (10)

Chap08
Chap08Chap08
Chap08
 
Machine Learning Decision Tree Algorithms
Machine Learning Decision Tree AlgorithmsMachine Learning Decision Tree Algorithms
Machine Learning Decision Tree Algorithms
 
Ch08(1)
Ch08(1)Ch08(1)
Ch08(1)
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate Regression
 
Ch08 ci estimation
Ch08 ci estimationCh08 ci estimation
Ch08 ci estimation
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Les5e ppt 06
Les5e ppt 06Les5e ppt 06
Les5e ppt 06
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
Describing quantitative data with numbers
Describing quantitative data with numbersDescribing quantitative data with numbers
Describing quantitative data with numbers
 

Similar to data science training in mumbai

Basic statistics 1
Basic statistics  1Basic statistics  1
Basic statistics 1Kumar P
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Standard deviation
Standard deviationStandard deviation
Standard deviationM K
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysiskompellark
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdfAmanuelDina
 
Intro to data science
Intro to data scienceIntro to data science
Intro to data scienceANURAG SINGH
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptxCalvinAdorDionisio
 
Lecture. Introduction to Statistics (Measures of Dispersion).pptx
Lecture. Introduction to Statistics (Measures of Dispersion).pptxLecture. Introduction to Statistics (Measures of Dispersion).pptx
Lecture. Introduction to Statistics (Measures of Dispersion).pptxNabeelAli89
 
Probability and Statistics part 3.pdf
Probability and Statistics part 3.pdfProbability and Statistics part 3.pdf
Probability and Statistics part 3.pdfAlmolla Raed
 
Multiple-Linear-Regression-Model-Analysis.pptx
Multiple-Linear-Regression-Model-Analysis.pptxMultiple-Linear-Regression-Model-Analysis.pptx
Multiple-Linear-Regression-Model-Analysis.pptxNaryCasila
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Ch3_Statistical Analysis and Random Error Estimation.pdf
Ch3_Statistical Analysis and Random Error Estimation.pdfCh3_Statistical Analysis and Random Error Estimation.pdf
Ch3_Statistical Analysis and Random Error Estimation.pdfVamshi962726
 
Estimation and confidence interval
Estimation and confidence intervalEstimation and confidence interval
Estimation and confidence intervalHomework Guru
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdfthaersyam
 

Similar to data science training in mumbai (20)

Basic statistics 1
Basic statistics  1Basic statistics  1
Basic statistics 1
 
Res701 research methodology lecture 7 8-devaprakasam
Res701 research methodology lecture 7 8-devaprakasamRes701 research methodology lecture 7 8-devaprakasam
Res701 research methodology lecture 7 8-devaprakasam
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Intro to data science
Intro to data scienceIntro to data science
Intro to data science
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
Lecture. Introduction to Statistics (Measures of Dispersion).pptx
Lecture. Introduction to Statistics (Measures of Dispersion).pptxLecture. Introduction to Statistics (Measures of Dispersion).pptx
Lecture. Introduction to Statistics (Measures of Dispersion).pptx
 
Probability and Statistics part 3.pdf
Probability and Statistics part 3.pdfProbability and Statistics part 3.pdf
Probability and Statistics part 3.pdf
 
Hm306 week 4
Hm306 week 4Hm306 week 4
Hm306 week 4
 
Hm306 week 4
Hm306 week 4Hm306 week 4
Hm306 week 4
 
Multiple-Linear-Regression-Model-Analysis.pptx
Multiple-Linear-Regression-Model-Analysis.pptxMultiple-Linear-Regression-Model-Analysis.pptx
Multiple-Linear-Regression-Model-Analysis.pptx
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Ch3_Statistical Analysis and Random Error Estimation.pdf
Ch3_Statistical Analysis and Random Error Estimation.pdfCh3_Statistical Analysis and Random Error Estimation.pdf
Ch3_Statistical Analysis and Random Error Estimation.pdf
 
Estimation and confidence interval
Estimation and confidence intervalEstimation and confidence interval
Estimation and confidence interval
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
5.DATA SUMMERISATION.ppt
5.DATA SUMMERISATION.ppt5.DATA SUMMERISATION.ppt
5.DATA SUMMERISATION.ppt
 

More from devipatnala1

data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 
best data science certification
best data science certificationbest data science certification
best data science certificationdevipatnala1
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabaddevipatnala1
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangaloredevipatnala1
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangaloredevipatnala1
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhidevipatnala1
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in bangloredevipatnala1
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in bangloredevipatnala1
 
data science course
data science coursedata science course
data science coursedevipatnala1
 
data science course
data science coursedata science course
data science coursedevipatnala1
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangaloredevipatnala1
 
data science course
data science coursedata science course
data science coursedevipatnala1
 
data science online course
data science online coursedata science online course
data science online coursedevipatnala1
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabaddevipatnala1
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabaddevipatnala1
 
data science training in hyderabad
data science training in hyderabaddata science training in hyderabad
data science training in hyderabaddevipatnala1
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangaloredevipatnala1
 

More from devipatnala1 (20)

data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
best data science certification
best data science certificationbest data science certification
best data science certification
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabad
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangalore
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangalore
 
data science course in delhi
data science course in delhidata science course in delhi
data science course in delhi
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in banglore
 
data science courses in banglore
data science courses in bangloredata science courses in banglore
data science courses in banglore
 
data science course
data science coursedata science course
data science course
 
data science course
data science coursedata science course
data science course
 
pmi acp training bangalore
pmi acp training bangalorepmi acp training bangalore
pmi acp training bangalore
 
data science course
data science coursedata science course
data science course
 
data science online course
data science online coursedata science online course
data science online course
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabad
 
agile training in hyderabad
agile training in hyderabadagile training in hyderabad
agile training in hyderabad
 
data science training in hyderabad
data science training in hyderabaddata science training in hyderabad
data science training in hyderabad
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 

Recently uploaded

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Recently uploaded (20)

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

data science training in mumbai

  • 1. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Data Science using R, Minitab & XLMiner R, Minitab XLMiner for Forecasting
  • 2. © 2013 - 2016 ExcelR Solutions. All Rights Reserved PMP PMI-ACP PMI-RMP CSM LSSGB Project Management Professional Agile Certified Practitioner Risk Management Professional Certified Scrum Master Lean Six Sigma Green Belt LSSBB SSMBB ITIL Lean Six Sigma Black Belt Six Sigma Master Black Belt Information Technology Infrastructure Library Agile PM Dynamic System Development Methodology Atern Name: Bharani Kumar Education: IIT Hyderabad Indian School of Business Professional certifications: My Introduction
  • 3. © 2013 - 2016 ExcelR Solutions. All Rights Reserved HSBC Driven using UK policies ITC Infotech Driven using Indian policies SME Infosys Driven using Indian policies under Large enterprises Deloitte Driven using US policies 1 2 3 4 My Introduction RESEARCH in ANALYTICS, DEEP LEARNING & IOT DATA SCIENTIST
  • 4. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Tuckman Model
  • 5. © 2013 - 2016 ExcelR Solutions. All Rights Reserved AGENDA Data Visualization using Tableau Data Mining – Supervised & Unsupervised (Machine Learning) Text Mining & NLP AGENDA
  • 6. © 2013 - 2016 ExcelR Solutions. All Rights Reserved What does it take to be a DATA SCIENTIST? Successful Data Scientist All Agenda Topics Domain Knowledge Practice Statistical Analysis Data Minin g Forecasting Data Visualizatio n
  • 7. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Welcome to the Information Age … … drowning in data and starving for Knowledge
  • 8. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 500 million tweets every day, 1.3 billion accounts YouTube users upload 100 hours of video every minute 306 items are purchased every second 26.6 Million transactions per day 100 terabytes of data uploaded daily http://www.dnaindia.com/scitech/report-facebook-saw- one-billion-simultaneous-users-on-aug-24-2119428 Processing 100 petabytes a day (1 petabyte = 1000 terabytes) More than 1 million customer transactions every hour BIG DATA! https://www.techinasia.com/alibaba-crushes-records-brings-143-billion-singles-day
  • 9. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau?
  • 10. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau?
  • 11. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau?
  • 12. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Why Tableau?
  • 13. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1 2 3 4 5 Data Types – Continuous, Discrete, Nominal, Ordinal, Interval, Ratio, Random Variable, Probability, Probability Distribution First, second, third & fourth moment business decisions Graphical representation – Barplot, Histogram, Boxplot, Scatter diagram Simple Linear Regression Hypothesis Testing Agenda – Basic Statistics
  • 14. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Data Types – Continuous & Discrete
  • 15. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Data Types – Preliminaries
  • 16. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Random Variable
  • 17. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability
  • 18. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distribution
  • 19. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Applications
  • 20. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Population Sampling Frame SRS Sample Sampling Funnel
  • 21. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Central Tendency Population Sample Mean / Average Median Middle value of the data Mode Most occurring value in the data Measures of Central Tendency “Every American should have above average income, and my Administration is going to see they get it.” – American President
  • 22. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Measures of Dispersion
  • 23. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Dispersion Population Sample Variance Standard Deviation Range Max – Min Measures of Dispersion
  • 24. © 2013 - 2016 ExcelR Solutions. All Rights Reserved  For a probability distribution, the mean of the distribution is known as the expected value  The expected value intuitively refers to what one would find if they repeated the experiment an infinite number of times and took the average of all of the outcomes  Mathematically, it is calculated as the weighted average of each possible value Expected Value The formula for calculating the expected value for a discrete random variable X, denoted by μ, is: The variance of a discrete random variable X, denoted by σ2 is
  • 25. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Graphical Techniques – Bar Chart
  • 26. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Graphical Techniques – Histogram A Histogram Represents the frequency distribution, i.e., how many observations take the value within a certain interval.
  • 27. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Skewness & Kurtosis • A measure of asymmetry in the distribution • Mathematically it is given by E[(x-µ/σ)]3 • Negative skewness implies mass of the distribution is concentrated on the right Third and Fourth moments Skewness Kurtosis • A measure of the “Peakedness” of the distribution • Mathematically it is given by E[(x-µ/σ)]4 -3 • For Symmetric distributions, negative kurtosis implies wider peak and thinner tails
  • 28. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Graphical Techniques – Box Plot Box Plot : This graph shows the distribution of data by dividing the data into four groups with the same number of data points in each group. The box contains the middle 50% of the data points and each of the two whiskers contain 25% of the data points. It displays two common measures of the variability or spread in a data set Range : It is represented on a box plot by the distance between the smallest value and the largest value, including any outliers. If you ignore outliers, the range is illustrated by the distance between the opposite ends of the whiskers Range(IQR): The middle half of a data set falls within the inter- quartile range Inter- quartile
  • 29. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Normal Distribution  The Probability associated with any single value of a random variable is always zero  Area under the entire curve is always equal to 1  The normal random variable takes values from -∞ to +∞
  • 30. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Normal Distribution Characterized by a bell shaped curve Has the following properties: 68.26% of values lie within ±1 σ from the mean 95.46% of the values lie within ±2 σ from the mean 99.73% of the values lie within ± 3σ from the mean
  • 31. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Normal Distribution Characterized by mean, µ, and standard deviation, σ X~N(µ,σ)
  • 32. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Z scores, Standard Normal Distribution • For every value (x) of the random variable X, we can calculate Z score: • Interpretation − How many standard deviations away is the value from the mean ? Z = X−µ σ
  • 33. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculating Probability from Z distribution Suppose GMAT scores can be reasonably modelled using a normal distribution − µ = 711 σ = 29 What is p(x ≤ 680)? Step 1: Calculate Z score corresponding to 680 - Z = (680-711)/29 = -1.06 Step 2: Calculate the probabilities using Z – Tables - P(Z ≤ -1) = 0.14
  • 34. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculating Probability from Z distribution • What is P( 697 ≤ X ≤ 740) ? • Step 1 : Use P(x1 ≤ X ≤ x2) = Use P( X ≤ x2) − P( X ≤ x1) • Step 2 : Calculate P( X ≤ x2) and P( X ≤ x1) as before P( X ≤ 740) = P( Z ≤ 1) = 0.84 ; P( X ≤ 697) = P( Z ≤ - 0.5) = 0.31 • Step 3 : Calculate P( 697 ≤ X ≤ 740 ) = 0.84 – 0.31 = 0.53
  • 35. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Normal Quantile (Q-Q) Plot SampleQuantiles Theoretical Quantiles
  • 36. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Sampling variation  Sample mean can be (and most likely is) different from the population mean  Sample mean varies from one sample to another  Sample mean is a random variable
  • 37. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Central Limit Theorem The standard error of the mean estimates the variability between samples whereas the standard deviation measures the variability within a single sample The Distribution of the sample mean - will be normal when the distribution of data in the population is normal - will be approximately normal even if the distribution of data in the population is not normal if the “sample size” is fairly large Mean ( X ) = µ ( the same as the population mean of the raw data) Standard Deviation (X) = , where σ is the population standard deviation and n is the sample size - This is referred to as standard error of mean _ σ √𝑛
  • 38. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Sample Size Calculation A Sample Size of 30 is considered large enough, but that may /may not be adequate More Precise conditions - n > 10( K3 )2 , where ( K3 ) is sample skewness and - n > 10( K4 ) , where ( K4) is sample kurtosis
  • 39. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval • What is the Probability of tomorrow’s temperature being 42 degrees ? Probability is ‘0’ • Can it be between [-50⁰C & 100⁰C] ?
  • 40. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Case Study: Confidence Interval • A University with 100,000 alumni is thinking of offering a new affinity credit card to its alumni. • Profitability of the card depends on the average balance maintained by the card holders. • A Market research campaign is launched, in which about 140 alumni accept the card in a pilot launch. • Average balance maintained by these is $1990 and the standard deviation is $2833. Assume that the population standard deviation is $2500 from previous launches. • What we can say about the average balance that will be held after a full−fledged market launch ?
  • 41. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Interval estimates of parameters • Based on sample data − The point estimate for mean balance = $1990 − Can we trust this estimate ? • What do you think will happen if we took another random sample of 140 alumni ? • Because of this uncertainty, we prefer to provide the estimate as an interval (range) and associate a level of confidence with it Interval Estimate = Point Estimate ± Margin of Error
  • 42. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval for the Population Mean Start by choosing a confidence level (1-α) % (e.g. 95%, 99%, 90%) Then, the population mean will be with in X ± Z1-ᾳ where Z1-ᾳ satisfies p( -Z1-ᾳ ≤ Z ≤ Z1-ᾳ) = 1-ᾳσ √𝑛 Margin of error depends on the underlying uncertainty, confidence level and sample size _ Interval Estimate = Point Estimate ± Margin of Error
  • 43. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculate Z value - 90%, 95% & 99%
  • 44. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval Calculation • Based on the survey and past data • Construct a 95% confidence interval for the mean card balance and interpret it ? • Construct a 90% confidence interval for the mean card balance and interpret it ? − n = 140; σ = $2500; X = $ 1990 − σ X _ - σ √𝑛 = = 2500 √140 = 211.29
  • 45. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval Interpretation Consider the 95% Confidence interval for the mean income : [$1576, $2404] Does this mean that - The mean balance of the population lies in the range ? - The mean balance is in this range 95% of the time ? - 95% of the alumni have balance in this range ? Interpretation 1 : Mean of the population has a 95% chance of being in this range for a random sample Interpretation 2 : Mean of the population will be in this range for 95% of the random samples
  • 46. © 2013 - 2016 ExcelR Solutions. All Rights Reserved What if we don’t know Sigma? • Suppose that the alumni of this university are very different and hence population standard deviation from previous launches can not be used We replace σ with our best guess (point estimate) s, which is the standard deviation of the sample: Calculate • If the underlying population is normally distributed , T is a random variable distributed according to a t-distribution with n-1 degrees of freedom Tn-1 • Research has shown that the t-distribution is fairly robust to deviation of the population of the normal model
  • 47. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Student’s t-distribution As n ꝏ tn N(0,1) i.e., as the degrees of the freedom increase, the t-distribution approaches the standard normal distribution
  • 48. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Confidence Interval for mean with unknown Sigma
  • 49. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Calculating t-value • Construct a 95% confidence interval for the mean card balance and interpret it? − n = 140; σ = $2500; X = $ 1990 − σ X _ - = 2833 √140 = 239.46 Then the 95% confidence interval for balance is [$1516, $2464] Calculate t0.95, 139 = 1.98
  • 50. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Right Decision Confidence Type II error Right Decision Power Type I error Ho is TRUE H1 is TRUE Fail to Reject Ho Reject Ho Hypothesis Testing 1-α 1-β Start with Hypothesis about a Population Parameter Collect Sample Information Reject/Do Not Reject Hypothesis The factors that affect the power of a test include sample size, effect size, population variability, and 𝛼. Power and 𝛼 are related as increasing 𝛼 decreases 𝛽. Since power is calculated by 1 minus 𝛽, if you increase 𝛼,You also increase the power of a test. The maximum power a test can have is 1, whereas the minimum value is 0.
  • 51. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Our quality will not improve after the consulting project We will acquire 8,000 new customers if I open a store in this area We will need 400 more person hours to finish this project The retail market will grow by 50% in the next 5 years Our potential customers do not spend more than 60 minutes on the web every day Less than 5% clients will default on their loans Hypothesis Testing
  • 52. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing
  • 53. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing
  • 54. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample Z test The length of 25 samples of a fabric are taken at random. Mean and standard deviation from the historic 2 years study are 150 and 4 respectively. Test if the current mean is greater than the historic mean. Assume α to be 0.05 Normality Test Stat > Basic Statistics > Graphical Summary 1 Population Standard Deviation Known or Not 1 Sample Z Test Stat > Basic Statistics > 1 Sample Z 2 3 Fabric Data
  • 55. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample Z test – Write Hypothesis
  • 56. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Y: Fabric Length is continuous X: Discrete 1 Population We are comparing mean with external standard of 150mm Data was shown to be normal Population standard deviation is known=4
  • 57. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample t Test The mean diameter of the bolt manufactured should be 10mm to be able to fit into the nut. 20 samples are taken at random from production line by a quality inspector. Conduct a test to check with 95% confidence that the mean is not different from the specification value. Normality Test Stat > Basic Statistics > Graphical Summary 1 Population Standard Deviation Known or Not 1 Sample t Test Stat > Basic Statistics > 1 Sample t 2 3 Bolt Diameter
  • 58. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample t Test – Write Hypothesis
  • 59. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Y: Bolt Diameter is continuous X: Discrete 1 Population We are comparing mean with external standard of 10mm Data was given to be Normal Population standard deviation is NOT known
  • 60. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample Sign Test The scores of 20 students for the statistics exam are provided. Test if the current median is not equal to historic median of 82. Assume ‘’ to be 0.05 Normality Test Stat > Basic Statistics > Graphical Summary 1 1 Sample Sign Test Stat > Non Parametric > 1 Sample sign 3 Student Scores
  • 61. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Sample Sign Test – Write Hypothesis
  • 62. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 2-Sample t Test A financial analyst at a Financial institute wants to evaluate a recent credit card promotion. After this promotion, 450 cardholders were randomly selected. Half received an ad promoting a full waiver of interest rate on purchases made over the next three months, and half received a standard Christmas advertisement. Did the ad promoting full interest rate waiver, increase purchases? Normality Test Stat > Basic Statistics > Graphical Summary 1 Variance Test Stat > Basic Statistics > 2 Variance 2 Sample t Test Stat > Basic Statistics > 2-Sample t 2 3 Marketing Strategy
  • 63. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 2-Sample t Test – Write Hypothesis
  • 64. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing
  • 65. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Paired T Test • This test is used to compare the means of two sets of observations when all the other external conditions are the same • This is a more powerful test as the variability in the observations is due to differences between the people or objects sampled is factored out Example: To find out if medication A lowers blood pressure
  • 66. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Trigger your thoughts! Comparing the performance of machine A vs. machine B by feeding different raw materials to each machine Compare the performance of machine A vs. machine B when the same raw material is fed to each machine Compare the power output of a wind mill when you use motor A for 1 month and motor B for 1 month Compare the power output of two wind mills next to each other simultaneously when you use motor A on one wind mill and motor B on another Identifying resistor defects and capacitor defects in same PCB by collecting such data using 20 PCB units Identifying resister defects on 20 PCB’s and capacitor defects on 20 (different) PCB’s
  • 67. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 2-Sample t test or Paired T test Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10 vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles, additive to be tested is added with the fuel and their mileage is recorded. Find if the mileage increases by adding the fuel additive. Assume the same data was recorded if only 10 vehicles were chosen and mileage was recorded before and after adding the additive. What method will you choose to find the result. 2-Sample t test Paired T test
  • 68. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Mann-Whitney test Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10 vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles, additive to be tested is added with the fuel and their mileage is recorded. Find if the mileage increases by adding the fuel additive. Normality Test Stat > Basic Statistics > Graphical Summary 1 Mann – Whitney test for Medians Stat > Non Parametric > Mann Whitney 2 Vehicle with & without Additives
  • 69. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Mann-Whitney Test – Write Hypothesis
  • 70. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Paired T test Effect of fuel additive on vehicles is being studied. Out of a total of 20 vehicles, 10 vehicles are chosen randomly and mileage is recorded. In rest of the 10 vehicles, additive to be tested is added with the fuel and their mileage is recorded. Find if the mileage increases by adding the fuel additive. Assume the same data was recorded if only 10 vehicles were chosen and mileage was recorded before and after adding the additive. Normality Test Stat > Basic Statistics > Graphical Summary 1 Paired T Test Stat > Basic Statistic > Paired T 2 Vehicle with & without Additives • Since the data was not normal, the cause of non-normality was investigated and it was found that the first data point for “with additive” was wrongly entered. This value should have been 20. Now, proceed with the rest of the analysis. • If the data were truly non-normal our analysis would stop here.
  • 71. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Paired T test – Write Hypothesis
  • 72. © 2013 - 2016 ExcelR Solutions. All Rights Reserved One-Way ANOVA A marketing organization outsources their back-office operations to three different suppliers. The contracts are up for renewal and the CMO wants to determine whether they should renew contracts with all suppliers or any specific supplier. CMO want to renew the contract of supplier with the least transaction time. CMO will renew all contracts if the performance of all suppliers is similar Normality Test Stat > Basic Statistics > Graphical Summary 1 Variance Test Stat > ANOVA > Test for Equal Variances ANOVA Stat > ANOVA > One-Way…. 2 3 Contract Renewal
  • 73. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Example : More weight reduction programs • She randomly assigns equal number of participants to each of these programs from a common pool of volunteers • Suppose the nutrition expert would like to do a comparative evaluation of three diet programs(Atkins, South Beach, GM) • Suppose the average weight losses in each of the groups(arms) of the experiments are 4.5kg, 7kg, 5.3kg • What can she conclude?
  • 74. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Two kinds of variation matter • Not every individual in each program will respond identically to the diet program • Easier to identify variations across programs if variations within programs are smaller • Hence the method is called Analysis of Variance(ANOVA) • With-in group variation = Experimental Error • Between group variation
  • 75. © 2013 - 2016 ExcelR Solutions. All Rights Reserved • It should be obvious that for every observation : Totij = ti + eij • What is more surprising and useful is: Formalizing the intuition behind variations
  • 76. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Statistically test for equality means • n subjects equally divided into r groups • Hypothesis - H0: μ1 = μ2 = μ3 = … = μr - Not all μi are equal • Calculate - Mean Square Treatment MSTR = SSTR / (r‐1) - Mean Square Error MSE = SSE / (n‐r) - The ratio of two squares f = MSTR/MSE = Between group variation/Within group variation - Strength of this evidence p‐value = Pr(F(r‐1,n‐r) ≥ f) • Reject the null hypothesis if p‐value < α
  • 77. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Analysis of variance(ANOVA) • ANOVA can be used to test equality of means when there are more then 2 populations • ANOVA can be used with one or two factors • If only one factor is varying, then we would use a one-way ANOVA – Example: We are interested in comparing the mean performance of several departments within a company. Here the only factor is the name of department – If there are two factors, we would use a two way ANOVA. Example: One factor is department and the second factor is the shift.(day vs. Night)
  • 78. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Analysis of variance(ANOVA) Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic Between Treatments SSFactor K-1 MSFactor = SSFactor / DFFactor F = MSFactor / MSError Within Treatment SSError N-k MSError = SSError / DFError Total SSTotal N-1 Source of Variation Sum of Squares (SS) Degrees of Freedom Mean Square (MS) F Test Statistic Factor A SSA nA - 1 MSA = SSA / (nA – 1) FA = MSA / MSE Factor B SSB nB - 1 MSB = SSB / (nB – 1) FB = MSB / MSE Interaction A * B SSAB (nA – 1) (nB – 1) MSAB = SSAB / (nAB – 1) FAB = MSAB / MSE Error SSE n – nA * nB MSE = SSE / (n – nA * nB) Total SST n - 1 One Way ANOVA Two Way ANOVA
  • 79. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Is the Transaction time dependent on whether person A or B processes thetransaction? Is medicine 1 effective or medicine 2 at reducing heart stroke? Is the new branding program more effective in increasing profits? Does the productivity of employees vary depending on the three levels? (Beginner, Intermediate and Advanced) Three different sale closing methods were used. Which one is most effective? Four types of machines are used. Is weight of the Rugby ball dependent on the type ofmachine used? 2 Sample t-test ANOVA – One Way Dichotomies
  • 80. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Non-Parametric equivalent to ANOVA • When the data are not normal or if the data points are very few to figure out if the data are normal and we have more than 2 populations, we can use the Mood’s Median or Kruskal Wallis test to compare the populations Ho : All the medians are the same Ha : One of the medians is different • Mood’s median assigns the data from each population that is higher than the overall median to one group, and all points that are equal or lower to another group. It then uses a Chi-Square test to check if the observed frequencies are close to expected frequencies • Kruskal Wallis is another test that is non-parametric equivalent of ANOVA. Kruskal Wallis is the extension of Mann-Whitney test
  • 81. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Mood’s Median & Kruskal Wallis Growth is measured for three treatments as shown in the case study. Compare the effect of the three treatments on growth. Mood’s Median – handles outliers well Stat > Nonparametric > Mood’s Median 1 Kruskal Wallis – more powerful than Mood’s Median Stat > Nonparametric > Kruskal Wallis 2 Height Growth
  • 82. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Hypothesis Testing
  • 83. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 1-Proportion Test • A poll is carried out to find the acceptability of new football coach by the people. It was decided that if the support rate for the coach for the entire population was truly less then 25%, the coach would be fired • 2000 people participated and 482 people supported the new coach • Conduct a test to check if the new coach should be fired with 95% level of confidence Football Coach 1-Proportion Test Stat > Basic Statistics > 1-Proportion
  • 84. © 2013 - 2016 ExcelR Solutions. All Rights Reserved 2-Proportion Test Johnnie Talkers soft drinks division sales manager has been planning to launch a new sales incentive program for their sales executives. The sales executives felt that adults (>40 yrs) won’t buy, children will & hence requested sales manager not to launch the program. Analyze the data & determine whether there is evidence at 5% significance level to support the hypothesis Johnnie Talkers Proportion A = Proportion B Check p-valueHo Proportion A NOT = Proportion B If p-value < alpha, we reject Ho Ha
  • 85. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Chi-Square Test How can you determine whether the distribution of defects in your product or service has changed from the historic distribution over time, or exceeds an industry standard • Do you think mean is more significant or variance? Comparing population’s variance to a standard value involves calculating the chi-square test statistic We can also: Determine whether one variable is dependent over another Comparing observed & expected frequencies where variance is unknown. This is called as goodness-of-fit test Compare multiple proportions
  • 86. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Chi-Square Goodness-of-fit test Goodness-of-fit test is to test assumptions about the distributions that fit the process data Are observed frequencies (O) same or different from historical, expected or theoretical frequencies (E)? If there’s a difference between them, this suggests that the distribution model expressed by the expected frequencies does not fit the data
  • 87. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Chi-Square Test • A city has a newly opened nuclear plant, and there are families staying dangerously close to the plant. A health safety officer wants to take this case up to provide relocation for the families that live in the surrounding area. To make a strong case, he wants to prove with numbers that an exposure to radiation levels is leading to an increase in diseased population. He formulates a contingency table of exposure and disease. • Does the data suggest an association between the disease and exposure? Disease Total Exposure Yes No Yes 37 13 50 No 17 53 70 Total 54 66 120
  • 88. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Chi-Square Test Calculate the number of individuals of exposed and unexposed groups expected in each disease category (yes and no) if the probabilities were the same If there were no effect of exposure, the probabilities should be same and the chi-squared statistic would have a very low value. Proportion of population exposed = (50/120) = 0.42 Proportion of population not exposed = (70/120) = 0.58 Thus, expected values: Population with disease = 54 Exposure Yes : 54 * 0.42 = 22.5 Exposure No : 54 * 0.58 = 31.5 Population without disease = 66 Exposure Yes : 66 * 0.42 = 27.5 Exposure No : 66 * 0.58 = 38.5
  • 89. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Chi-Square Test • Calculate the Chi-squared statistic χ2 = Σ = = 29.1 • Calculate the degrees of freedom : (Number of rows – 1) X (Number of columns – 1) df = (2 – 1) X (2 – 1) = 1 • Calculate the p-value from the Chi-squared table For chi-squared value 29.1 and degrees of freedom = 1, from the table, p-value is < 0.001 • Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and observed values if there is no association • Conclusion : There is an association between the exposure and disease
  • 90. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Chi-Square Test Bahamantech Research Company uses 4 regional centers in South Asia (India, China, Srilanka and Bangladesh) to input data of questionnaire responses. They audit a certain % of the questionnaire responses versus data entry. Any error in data entry renders it defective. The chief data scientist wants to check whether the defective % varies by country. Analyze the data at 5% significance level and help the manager draw appropriate inferences. [‘1’ means not defectives & ‘0’ means defective] All proportions are equal Check p-valueHo Not all proportions are equal If p-value < alpha, we reject Ho Ha Bahaman Research
  • 91. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Non-Parametric Tests •Referred to as “distribution free”, as they don’t involve making assumptions of any data •They have lower power than the parametric tests and hence are always given the second preference after the parametric tests •These tests are typically focused on median rather than mean •They involve straight-forward procedures like counting and ordering
  • 92. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Thank You
  • 93. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Lognormal: • Fits many kinds of failure data • Used for reliability analysis, cycles-to-failure, loading variables & fatigue stress • Tensile strength of fibers & breaking strength of concrete • Environment data such as random quantities of pollutants in water or air • Economic variables such as per capita income • Extreme values are well managed & makes data normal • μ, σ are mean & standard deviation of natural logarithms Data Log transformed 12 2.48 28 3.33 87 4.47 143 4.96
  • 94. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Lognormal: • This distribution is right skewed • Skewness increases as value of σ increases • Pdf starts at zero, increases to its mode, and then decreases • If time-to-failure has a lognormal distribution, then the logarithm of time-to-failure has a normal distirbution
  • 95. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Exponential: • Length of time between check-ins at a reception desk, calls at a call center, customers at a cashier • Used when events occur continuously & independently at a constant average rate • Used to model rate of change that will occur in a given amount of time • How long equipment will keep working with proper maintenance & part replacement • Use to model behavior of independent variables that have a constant rate • The occurrences of variables are described by a Poisson distribution, but the times between occurrences are described by Exponential distribution • If X is Poisson distributed then Y = 1/X will be exponentially distributed • # of arrivals at a checkout counter, # of product failures over time – Poisson • Length of time between events, i.e., one arrival or failure & the next – Exponential distribution • Exponential distribution can model the interval between random events • λ = failure rate; θ = mean; x = random variable • Used to model mean time between occurrences • In exponential population, 37% of observations are below the mean & 63% are above • Uses constant failure rate
  • 96. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Weibull: • Model failure rate; rate is not constant • Model time to failure, time to repair & material strength • When system/item ages & failure rate increases/decreases • Can model different distributions due to having parameters of shape, scale & location • Can simulate Lognormal, Exponential & many other distributions • Use widely in reliability & statistical applications • Weibull & Lognormal are from same family & both can be used to assess the dataset that contains close to average values (not too high / low) • However, Weibull is a better fit when majority of data falls to the higher side • Lognormal is a better fit when majority of data falls to the lower side
  • 97. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Weibull: • β is shape parameter, also called as slope, determines the shape of the distribution  When beta = 1, shape of distribution = exponential distribution  When beta: 3 to 4, shape of distribution = normal distribution  Several beta values can approximate lognormal distribution • η is scaled parameter (eta), determines the spread or width of distribution • γ is non-zero location parameter, is the point, below which there are no failures, changing the value will move distribution to right or left  Gamma > 0, there is a period when no failures occur  Gamma < 0, failures have occurred before time equals zero e.g., defective raw materials or failure during transportation  When Gamma = 0, eta is called as characteristic life • Regardless of specific value of beta, 63.2% of values fall below the characteristic life
  • 98. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Bivariate Normal Distribution: • Used when 2 variables that are normally distributed & may be totally independent or may be correlated to some degree • A joint distribution of two independent variables that simultaneously & jointly cross-classifies the data • Can be discrete or continuous • 3D plot like mountain terrain • X & Y axes represent independent variables • Z axis shows either  frequency for discrete data  probability for continuous data • The maximum or peak occurs when X1 = Mu1 & X2 = Mu2. You can take a “slice” anywhere along the distribution by fixing one of the variables. This is known as a conditional distribution
  • 99. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Probability Distributions Bivariate Normal Distribution: • Can help determine items of critical importance: • Causality – examine the joint frequencies to investigate if the second variable changes in a systematic way when the first variable changes • Predictions – reviewing outcomes from one variable as the other changes • Importance – if two variables are causally related they should have a statistically significant impact
  • 100. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Scatter Diagram  Scatter diagrams or plots provides a graphical representation of the relationship of two continuous variables  Be Careful - Correlation does not guarantee causation. Correlation by itself does not imply a cause and effect relationship!  Judge strength of relationship by width or tightness of scatter  Determine direction of the relationship, e.g. If X increases, and Y decreases, it is negative correlation, similarly if X increases, and Y increases, it is positive correlation
  • 101. © 2013 - 2016 ExcelR Solutions. All Rights Reserved
  • 102. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Correlation Analysis  Correlation Analysis measures the degree of linear relationship between two variables  Range of correlation coefficient -1 to +1  Perfect positive relationship +1  Perfect negative relationship -1  No Linear relationship 0  If the absolute value of the correlation coefficient is greater than 0.85, then we say there is a good relationship • Example: r = 0.87, r = -0.9, r = 0.9, r = -0.87 describe good relationship • Example: r = 0.5, r = -0.5, r = 0.28 describe poor relationship  Correlation values of -1 or 1 imply an exact linear relationship. However, the real value of correlation is in quantifying less than perfect relationships  We can perform regression analysis, which attempts to further describe this type of relationship, if the correlation is good between the 2 variables
  • 103. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Correlation Analysis
  • 104. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Linear Regression Model The equation that represents how an independent variable is related to a dependent variable and an error term is a regression model y = β0 + β1x + ε Where, β0 and β1 are called parameters of the model, ε is a random variable called error term. β0 β1
  • 105. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Linear Regression Model y intercept Error term An observed value of x when x equals x0 Mean value of y when x equals x0 Straight line defined by the equation y = β0 + β1x X Y x0 = A specific value of x, the independent variable. β0 β1 Fitting a straight line by least squares ˆY = ˆb0 + ˆb1X
  • 106. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Regression Analysis  R-squared-also known as Coefficient of determination, represents the % variation in output (dependent variable) explained by input variables/s or Percentage of response variable variation that is explained by its relationship with one or more predictor variables  Higher the R^2, the better the model fits your data  R^2 is always between 0 and 100%  R squared is between 0.65 and 0.8 => Moderate correlation  R squared in greater than 0.8 => Strong correlation
  • 107. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Regression Analysis  Prediction and Confidence Interval are types of confidence intervals used for predictions in regression and other linear models  Prediction Interval: Represents a range that a single new observation is likely to fall given specified settings of the predictors  Confidence interval of the prediction: Represents a range that the mean response is likely to fall given specified settings of the predictors  The prediction interval is always wider than the corresponding confidence interval because of the added uncertainty involved in predicting a single response versus the mean response
  • 108. © 2013 - 2016 ExcelR Solutions. All Rights Reserved Regression Techniques – Simple Linear Regression Y = Continuous X = Single & Continuous Simple Linear Regression Y = Continuous X = Single & Discrete Simple Linear Regression Create Dummy Variable
  • 109. 109 Footer Copyright © 2015 ExcelR . All rights reserved. Simple Linear Regression – Dummy Variable Gender Dummy Variable Male 1 Female 0 Male 1 Female 0 Male 1 Male 1 Female 0 Male 1 Male 1 Female 0
  • 110. 110 Footer Copyright © 2015 ExcelR . All rights reserved. Simple Linear Regression – R A business problem: The Waist Circumference – Adipose Tissue data • Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of cardio-vascular diseases • Computed Tomography, commonly called the CT Scan is the only technique that allows for the precise and reliable measurement of the AT (at any site in the body) • The problems with using the CT scan are: • Many physicians do not have access to this technology • Irradiation of the patient (suppresses the immune system) • Expensive • Is there a simpler yet reasonably accurate way to predict the AT area? i.e., • Easily available • Risk free • Inexpensive • A group of researchers conducted a study with the aim of predicting abdominal AT area using simple anthropometric measurements, i.e., measurements on the human body • The Waist Circumference – Adipose Tissue data is a part of this study wherein the aim is to study how well waist circumference (WC) predicts the AT area
  • 111. 111 Footer Copyright © 2015 ExcelR . All rights reserved. Simple Linear Regression – Data Set Observation Waist AT Observation Waist AT Observation Waist AT 1 74.75 25.72 38 103 129 75 108 217 2 72.6 25.89 39 80 74.02 76 100 140 3 81.8 42.6 40 79 55.48 77 103 109 4 83.95 42.8 41 83.5 73.13 78 104 127 5 74.65 29.84 42 76 50.5 79 106 112 6 71.85 21.68 43 80.5 50.88 80 109 192 7 80.9 29.08 44 86.5 140 81 103.5 132 8 83.4 32.98 45 83 96.54 82 110 126 9 63.5 11.44 46 107.1 118 83 110 153 10 73.2 32.22 47 94.3 107 84 112 158 11 71.9 28.32 48 94.5 123 85 108.5 183 12 75 43.86 49 79.7 65.92 86 104 184 13 73.1 38.21 50 79.3 81.29 87 111 121 14 79 42.48 51 89.8 111 88 108.5 159 15 77 30.96 52 83.8 90.73 89 121 245 16 68.85 55.78 53 85.2 133 90 109 137 17 75.95 43.78 54 75.5 41.9 91 97.5 165 18 74.15 33.41 55 78.4 41.71 92 105.5 152 19 73.8 43.35 56 78.6 58.16 93 98 181 20 75.9 29.31 57 87.8 88.85 94 94.5 80.95 21 76.85 36.6 58 86.3 155 95 97 137 22 80.9 40.25 59 85.5 70.77 96 105 125 23 79.9 35.43 60 83.7 75.08 97 106 241 24 89.2 60.09 61 77.6 57.05 98 99 134 25 82 45.84 62 84.9 99.73 99 91 150 26 92 70.4 63 79.8 27.96 100 102.5 198 27 86.6 83.45 64 108.3 123 101 106 151 28 80.5 84.3 65 119.6 90.41 102 109.1 229 29 86 78.89 66 119.9 106 103 115 253 30 82.5 64.75 67 96.5 144 104 101 188 31 83.5 72.56 68 105.5 121 105 100.1 124 32 88.1 89.31 69 105 97.13 106 93.3 62.2 33 90.8 78.94 70 107 166 107 101.8 133 34 89.4 83.55 71 107 87.99 108 107.9 208 35 102 127 72 101 154 109 108.5 208 36 94.5 121 73 97 100
  • 112. 112 Footer Copyright © 2015 ExcelR . All rights reserved. Simple Linear Regression – Transformation reg <- lm(AT ~ Waist) # Linear Regression summary(reg) confint(reg, level=0.95) predict(reg, interval="predict”) reg_log <- lm(AT ~ log(Waist)) # Regression using Logarithmic Transformation summary(reg_log) confint(reg_log, level=0.95) predict(reg, interval="predict”) reg_exp <- lm(log(AT) ~ Waist) # Regression using Exponential Transformation summary(reg_exp) confint(reg_exp, level = 0.95) predict(reg, interval="predict”)
  • 113. 113 Footer Copyright © 2015 ExcelR . All rights reserved. Regression Techniques – Multiple Linear Regression Y = Continuous X = Multiple & Continuous Multiple Linear Regression Y = Continuous X = Multiple & Discrete Multiple Linear Regression Create Dummy Variable
  • 114. 114 Footer Copyright © 2015 ExcelR . All rights reserved. Multiple Linear Regression – Dummy Variable Make of car Dummy Variable_Petrol Dummy Variable_Diesel Dummy Variable_CNG Dummy Variable_LPG Petrol 1 0 0 0 Diesel 0 1 0 0 CNG 0 0 1 0 LPG 0 0 0 1 Diesel 0 1 0 0 CNG 0 0 1 0 Petrol 1 0 0 0 LPG 0 0 0 1 Petrol 1 0 0 0 LPG 0 0 0 1
  • 115. 115 Footer Copyright © 2015 ExcelR . All rights reserved. Multiple Regression Model DATA : CARS, 81 observations, “cars.csv” • VOL = cubic feet of cab space • HP = engine horsepower • MPG = average miles per gallon • SP = top speed, miles per hour • WT = vehicle weight, hundreds of pounds Our interest is to model the MPG of a car based on the other variables
  • 116. 116 Footer Copyright © 2015 ExcelR . All rights reserved. Model and Assumptions Our Model: ① Linearity (Assumptions about the form of the model): ◦ Linear in parameters ② Assumptions about the errors: ◦ IID Normal (Independently & identically distributed) ◦ Zero mean ◦ Constant variance (Homoscedasticity) ◦ If no constant variance (HETEROSCEDASTICITY) ◦ Independent of each other. If not independent, it is called as AUTO CORRELATION problem ③ Assumptions about the predictors: ◦ Non-random ◦ Measured without error ◦ Linearly independent of each other. If not it is called as COLLINEARITY problem ④ Assumptions about the observations: ◦ Equally reliable Y = b0 + b1X1 + b2X2 +......+ bk Xk +e Linear Independent Normal Equal Variance
  • 117. 117 Footer Copyright © 2015 ExcelR . All rights reserved. Techniques used for Discrete Output Logit Analysis Probit Analysis Logistic Regression 1 3 2
  • 118. 118 Footer Copyright © 2015 ExcelR . All rights reserved. Regression Techniques – Simple Logistic Regression Y = Discrete X = Single & Continuous Simple Logistic Regression Y = Discrete X = Single & Discrete Simple Logistic Regression Create Dummy Variable
  • 119. 119 Footer Copyright © 2015 ExcelR . All rights reserved. Logistic Regression • Logistic Regression model predicts the probability associated with each dependent variable Category How does it do this? • It finds linear relationship between independent variables and a link function of this probabilities. Then the link function that provides the best goodness-of-fit for the given data is chosen
  • 120. 120 Footer Copyright © 2015 ExcelR . All rights reserved. Logistic Regression Multiple Logistic Regression Model is quite similar to the Multiple Linear Regression Model, Only β coefficients vary
  • 121. 121 Footer Copyright © 2015 ExcelR . All rights reserved. Logistic Regression
  • 122. 122 Footer Copyright © 2015 ExcelR . All rights reserved. Logistic Regression Methods
  • 123. 123 Footer Copyright © 2015 ExcelR . All rights reserved. Assumptions in Logistic Regression Only one outcome per event – Like pass or fail The outcomes are statistically independent All relevant predictors are in the model One category at a time – Mutually exclusive & collectively exhaustive Sample sizes are larger than for linear regression 1 2 3 4 5
  • 124. 124 Footer Copyright © 2015 ExcelR . All rights reserved. Steps in Logistic Regression Collect & organize sample data Formulate Logistic Regression Model Check the model’s validity Determine Probabilities using Probability equation Compile the results 1 2 3 4 5
  • 125. 125 Footer Copyright © 2015 ExcelR . All rights reserved. Logistic Regression Example Imagine that you are a Data Scientist at a very large scale integration circuit manufacturing company. You want to know whether or not the time spent inspecting each product impacts the quality assurance department’s ability to detect a designing error in the circuit → Step-1: Collect and organize the sample data → Number of Observations → Error Identification → Inspection Time Number of Observations: 55 Observations of circuits with errors, and determine whether those errors were detected by QA