Stats chapter 4

Chapter 4
More about relationships between
2 variables

4.1 TRANSFORMING TO ACHIEVE
LINEARITY

What if the scatterplot is not
linear?
• Of course not all data is linear!
• Our method in statistics will involving
mathematically operating on one or both
of the explanatory and response variables
• An inverse transformation will be used to
create a non-linear regression model
• This will be a little “mathy”

Transformations
• Before we begin transformations,
remember that some well known
phenomenon act in predictable ways
– I.e. when working with time and gravity, you
should know that there is a square
relationship between distance and time!

The Basics
• The data from measurements (raw data)
must be operated on.
• Apply the same mathematical
transformation on the raw data
– Ex. “Square every response”
• Use methods from the previous chapter to
find the LSRL for the transformed data
• Analyze your regression to ensure the LSRL is
appropriate
• Apply an inverse transformation on the LSRL
to find the regression for the raw data.

Example
Please refer to p 265 exercise 4.2
Length (cm) Period (s)
16.5 0.777
17.5 0.839
19.5 0.912
22.5 0.878
28.5 1.004
31.5 1.087
34.5 1.129
37.5 1.111
43.5 1.290
46.5 1.371
106.5 2.115

Example
• Data inputted into L1
and L2
• Scatterplot
• Looks pretty good,
right?

Exercise
• LSRL
• Y=.6+.015X
r = 0.991
• Residual Plot
• Perhaps we can do
better!

Example
• L3 = L2^.5 (square
root)
• LinReg L1, L3
• Note that the value
of r2 has increased
• Note that the value
of the residual of the
last point has
decreased

Exponential Models
• Many natural phenomenon are explained
by an exponential model.
• Exponential models are marked by sharp
increases in growth and decay.
• Basic model: y = A·Bx
• For this transformation, you need to take
the logarithm of the response data.
• You may use “log10” or “ln” your choice.
– I prefer “ln” (of course)

Exponential Models
After the transformation, we have the
following linear model: ln(y) = a + b·x
1. ln(y) = a + b·x
2. eln(y) = e(a + b·x) exponentiate
3. y = ea · ebx property of logarithms
4. Let ‘A’ = ea redefine variables
‘B’ = eb
5. y = A·Bx this is our model

Exponential Models
• Since this is an ‘applied math’ course,
you need not remember how to apply the
inverse transformation
• Whew
• BUT you do need to memorize:
when ln(y) = a + bx
y = A·Bx
where ‘A’ = ea and ‘B’ = eb

Exponential Models
Let’s try this data

Exponential Models
Take the ln of L2- the
response list and store in L3

Exponential Models
These are our “transformed
responses”

Exponential Models
From our homescreen, we
perform an LSRL using the
transformed data

Exponential Models
We don’t have to store this
regression for transformed data

Exponential Models
Take note of the values of
‘a’ and ‘b’

Exponential Models
A quick look at the
residuals

Exponential Models
The values of the residuals are
small .. . no defined pattern

Exponential Models
• Our regression model is exponential
y = A·Bx
Where A = ea and B = eb
• y = e0.701 · (e0.184)x

Exponential Models
y = A·Bx
• y = e0.701 x (e0.184)x

Exponential Models
y = A·Bx
• y = e0.701 x (e0.184)x
• Or
y = 2.06 · (1.20)x

Exponential Models
Put our regression in Y1

Exponential Models
Change Plot1 from a
resid. to a scatter plot

Exponential Models
Looks pretty good, eh?

Power Models
• These models are used when the rate of
increase is less severe than an
exponential model, or if you suspect a
‘root’ model
• For this model, you will find the
logarithms of both the expl var and the
resp var

Power models
LSRL on transformed data yields:
ln(y) = a + b·ln(x)
1. ln(y) = a + b·ln(x)
2. e ln(y) = e(a + b·ln(x))
3. y = ea·eln(x^b)
4. y = ea ·xb
5. Let ‘A’ = ea
6. y = A · xb

Power models
Let’s use this data to find a power model

Power models
This time we need to transform both lists

Power models
Transformed exp = L3
Transformed resp = L4

Power models
LSRL on transformed data
no need to store in Y1

Power models
Take note of the values of ‘a’ and ‘b’

Power models
A quick look at the residuals

Power models
Note that we use the transformed exp var

Power models
No defined pattern

Power models
Residuals are all small in size

Power models
• When ln(y) = a + b·ln(x),
y = A · xb
where ‘A’ = ea
Our model is y = (e1.31)· x1.27

Power models
y = A · xb
where ‘A’ = ea
Our model is y = (e1.31) · x1.27

Power models
y = A · xb
where ‘A’ = ea
Our model is y = (e1.31) · x1.27
Or y = 3.71 · x1.27

Power models
Change from resid to scatter plot

Power models
(notice L1 and L2)

Power models
Looks pretty good!

Power models
• Much like the exponential model, you
only need to know how the transformed
model becomes the model for the raw
data.
y = A · xb
where ‘A’ = ea

Transformation thoughts
• Although this is not a major topic for the
course, you still need to be able to apply
these two transformations (exp and power)
• Be sure to check the residuals for the LSRL
on transformed data! You may have picked
the wrong model :/
• If one model doesn’t work, try the other. I
would start with the exponential model.
• Don’t transform into a cockroach. Ask Kafka!

Assn 4.1
• pg 276 #5, 8, 9, 11, 12

4.2 RELATIONSHIPS BETWEEN
CATEGORICAL VARIABLES

Marginal Distributions
• Tables that relate two categorical variables
are called “Two-Way Tables”
– Ex 4.11 pg 292
• Marginal Distribution
– Very fancy term for “row totals and column
totals”
– Named because the totals appear in the margins
of the table. Wow.
• Often, the percentage of the row or column
table is very informative

Age
Group
Female Male Total
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639

Age
Group
Female Male Total
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639
Column Totals

Age
Group
Female Male Total
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639
Row Totals

Age
Group
Female Male Total
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639
Grand Total

Marginal Distributions “Age Group”

Age
Group
Female Male Total Marg. Dist.
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639

Age
Group
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639
Row total / grand total
150/16639=0.009

Age
Group
15-17 89 61 150 0.9%
18-24 5668 4697 10365
25-34 1904 1589 3494
35 or
older
1660 970 2630
Totals 9321 7317 16639
Row total / grand total
150/16639=0.009

Age
Group
15-17 89 61 150 0.9%
18-24 5668 4697 10365 62.3%
25-34 1904 1589 3494 21.0%
35 or
older
1660 970 2630 15.8%
Totals 9321 7317 16639 100%
Adds to 100%

Marginal Distributions “Gender”
Age
Group
Female Male Total
15-17 89 61 150
18-24 5668 4697 10365
25-34 1904 1589 3494
35 &up 1660 970 2630
Totals 9321 7317 16639
Margin
dist.
56% 44% 100%
Similarly for columns

Describing Relationships
• Some relationships are easier to see when
we look at the proportions within each
group
• These distributions are called “Conditional
Distributions”
• To find a conditional distribution, find each
percentage of the row or column total.
• Let’s look at the same table, and find the
conditional distribution of gender, given
each age group

Conditional Distributions
Age
Group
Female Male Total
15-17 89 61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)

Age
Group
Female Male Total
15-17 89 61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
We will look at the
conditional
distribution for this
row

Age
Group
Female Male Total
15-17 89 61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
This cell is 89/150
(cell total /row total)
=53.9%

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
This cell is 89/150
=59.3%

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
This cell is 61/150
=40.7%

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
The table with
complete conditional
distributions for each
row

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
For an analysis of the
effect of age groups,
compare a row’s
conditional
distribution…

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
With the marginal
distribution for the
columns…

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
They should be close
…

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
… unless there is an
effect caused by the
age group (?)

Age
Group
Female Male Total
15-17 89
(59.3%)
61
(40.7%)
150
(100%)
18-24 5668
(54.7%)
4697
(45.3%)
10365
(100%)
25-34 1904
(54.5%)
1589
(45.5%)
3494
(100%)
35 or
older
1660
(63.1%)
970
(36.9%)
2630
(100%)
Totals 9321
(56%)
7317
(44%)
16639
(100%)
… and these are not
close to the marginal
distribution!

• Based on the previous table, the
distribution of “gender given age group”
are not that different.
• We can see that the “35 and older” group
seems to differ slightly from the overall
trend.

“age group given gender”
Age
Group
Female Male Total
15-17 89
(1%)
61
(0.8%)
150
(0.9%)
18-24 5668
(60.8%)
4697
(64.2%)
10365
(62.3%)
25-34 1904
(20.4%)
1589
(21.7%)
3494
(21.0%)
35 or
older
1660
(17.8%)
970
(13.3%)
2630
(15.8%)
Totals 9321
(100%)
7317
(100%)
16639
(100%)

Age
Group
Female Male Total
15-17 89
(1%)
61
(0.8%)
150
(0.9%)
18-24 5668
(60.8%)
4697
(64.2%)
10365
(62.3%)
25-34 1904
(20.4%)
1589
(21.7%)
3494
(21.0%)
35 or
older
1660
(17.8%)
970
(13.3%)
2630
(15.8%)
Totals 9321
(100%)
7317
(100%)
16639
(100%)
Here is the same chart
with the conditional
distributions by
gender…

Age
Group
Female Male Total
15-17 89
(1%)
61
(0.8%)
150
(0.9%)
18-24 5668
(60.8%)
4697
(64.2%)
10365
(62.3%)
25-34 1904
(20.4%)
1589
(21.7%)
3494
(21.0%)
35 or
older
1660
(17.8%)
970
(13.3%)
2630
(15.8%)
Totals 9321
(100%)
7317
(100%)
16639
(100%)
Is there a gender
effect noticeable from
this table?

Conditional Distribution
Conclusions from the previous chart
• Females are more likely to be in the “35 and
older group” and less likely to be in the “18
to 24” group
• Males are more likely to be in the “18 to 24”
group and less likely to be in the “35 and
older” group
• These differences appear slight. Are
actually “significant” with respect to the
overall distribution?

Conditional Distribution
• No single graph portrays the form of the
relationship between categorical
variables.
• No single numerical measure (such as
correlation) summarizes the strength of
the association.

Simpson’s Paradox
• Associations that hold true for all of
several groups can reverse direction
when teh data is combined to form a
single group.
• EX 4.15 pg 299
• This phenomenon is often the result of an
“unaccounted” variable.

Assignment 4.2
• Pg 298 #23-25, 29, 31-35

Different Relationships
• Suppose two variables (X and Y) have
some correlation
– i.e. when X increases in value, Y increases as
well
– One of the following relationships may hold.

Causation
• In this relationship, the explanatory
variable is somehow affecting the
response variable.
• In most instances, we are looking to find
evidence of a causation relationship

Causation

Common Response
• In this relationship, both X and Y are
correlated to a third (unknown) variable
(Z).
• EX, When Z increases X increases and Y
increases.
• Unless we known about Z, it appears as
though X and Y have a causation
relationship.

Common Response

Confounding
• X and Y have correlation,
• An (often unknown) third variable ‘Z”
also has correlation with Y
• Is X the explanatory variable, or is Z the
explanatory variable, or are the both
explanatory variables?

Confounding

Causation
• The best way to establish causation is
with a carefully designed experiment
– Possible ‘lurking variables’ are controlled
• Experiments cannot always be conducted
– Many times, they are costly or even unethical
• Some guidelines need to be established in
cases where an observational study is the
only method to measure variables.

Causation- some criteria
• Association is strong
• Association is consistent (among different
studies)
• Large values of the response variable are
associated with stronger responses
(typo?)
• The alleged cause precedes the effect in
time
• The alleged cause is probable

Assignment 4.3
Pg312 #41, 45, 50, 51

Chapter 4 Review
• #37, 53, 54, 57

Stats chapter 4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Stats chapter 4

Similar to Stats chapter 4 (20)

More from Richard Ferreria

More from Richard Ferreria (20)

Recently uploaded

Recently uploaded (20)

Stats chapter 4

Editor's Notes