2. http://publicationslist.org/junio
What is it about?
When dealing with two variables, the main interest is to
know if and how they are interrelated
To this end, plotting one variable against the other is the
straightforward course of action Scatter Plots
4. http://publicationslist.org/junio
Scatter Plots (xy plot)
Example: typical data as, for instance, the prevalence of skin
cancer as a function of the mean income for group of
individuals, or the unemployment rate as a function of the
frequency of highschool dropouts
5. http://publicationslist.org/junio
Scatter Plots (xy plot)
Example: typical data as, for instance, the prevalence of skin
cancer as a function of the mean income for group of
individuals, or the unemployment rate as a function of the
frequency of highschool dropouts
In this example, which is not rare, the plot is not conclusive
about the presence of a relationship
7. http://publicationslist.org/junio
Linear regression
Given a controlled input variable x, and a corresponding output
response y, we are looking for a linear function f (x) = a + bx =
y that reproduces the response with the least amount of
error; a linear regression is a function that minimizes the error in
the responses for a given set of inputs
The technique must not be misunderstood as a summarization
technique, but rather as a prediction technique
8. http://publicationslist.org/junio
Linear regression
The math behind linear regression is surprisingly simple, what
makes it so popular (and misused, as well); its principle is to
minimize (on a and b) the squared difference between the
actual data and f(x) = a+bx
With a little algebra, the preferred values for a and b are given
by:
However, linear regression can be misleading
11. http://publicationslist.org/junio
Linear regression
All the four data sets of the Anscombe’s quartet have the same
linear regression, however, they are essentially different
• The first data set is represented correctly
• The second is not linear
• The third has an expressive outlier, not embraced by the regression
• The fourth does not have enough independent values x in order to
provide a linear regression (only two values: 8.0 and 19.0)
• The problem is even worse, the confidence intervals of the
data sets are all the same as well, so the problem is noticed
only when the data is plotted
To verify a linear regression, a useful exercise is to verify
where the next response will fall into the plot – it is ok only if
the response falls in the line defined by the points already known
12. http://publicationslist.org/junio
Linear regression
Use linear regression only if:
the data can be described by a straight line
the data is well-behaved, that is, no expressive outliers
there are enough values for the controlled variable
In any case, linear regression must be accompanied with a
scatter plot so that visual verification is possible
13. http://publicationslist.org/junio
Dealing with noisy data
When the data is noisy, it is often helpful to find a smooth curve that
represents it so that trends and structure can be more easily noticed
Two methods are frequently used: weighted splines (Splines) and
locally weighted regression (LOESS or LOWESS)
Both work by approximating the data in a small neighborhood
(locally) by a polynomial of low order (at most cubic), following
an adjustable parameter that controls the stiffness of the curve
The stiffer the curve, the smoother it appears but the less accurately
it can follow the individual data points balancing smoothness
and accuracy is the challenge here
14. http://publicationslist.org/junio
Splines
Splines are constructed from piecewise polynomial functions
(typically cubic) that are joined together in a smooth fashion
Cubic interpolation polynomials for each consecutive pair of
points and required, so that these individual polynomials have the
same values, as well as the same first and second derivatives, at the
points where they meet; these smoothness conditions lead to a
set of linear equations for the coefficients in the
polynomials, which can be solved and the spline curve can be
evaluated at any desired location
15. http://publicationslist.org/junio
Splines
1st term 2nd term
In addition to the local smoothness requirements at each joint, splines must also satisfy a
global smoothness condition by optimizing (minimizing) the functional:
where s(t) is the spline curve, (xi, yi) are the coordinates of the two-variables data points, wi
are weight factors (one for each point), and is a mixing factor
The 1st term controls how wiggly the spline is – many wiggles lead to large second
derivatives; the 2nd term captures how accurately the spline represents the data
points by measuring the squared deviation of the spline from each data point
The wi values can be given by wi=1/ , where di measures how close the spline should
pass by (xi,yi), that is, greater weights for points that the spline should be close
(previously chosen pivots, for example)
The value mixes the importance of the 1st ( ) and the 2nd (1 − ) terms, balancing
smoothness and accuracy; high values will avoid wiggly curves, and low values will
lead to more precise, though, less sooth curves the main parameter for off-the-
shelf plotting software
17. http://publicationslist.org/junio
LOESS (locally weighted regression)
LOESS consists of approximating the data locally through a low-order
(typically linear) polynomial (regression), while weighting all the data points
in such a way that points close to the location of interest contribute
more strongly than do data points farther away (local weighting)
Its linear case finds parameters a and b that minimize the least-squares
equality:
where a+bxi-yi is the LOESS curve at (xi, yi) and w(x) is the weight
function – usually a smooth and peaked kernel as
= (1 − | | ) < 1; 0 ℎ ;
Notice how the weighting function is sensible to the distance
between point x and all the other xi points
LOESS is computationally intensive, as the entire calculation must be
performed for every point at which we want to obtain a smoothed value
18. http://publicationslist.org/junio
LOESS (locally weighted regression)
As it can be seen, the plot of the points
shows no evidence of biasing or of any
kind of pattern
However, if LOESS is used to represent
the data as a smooth curve, it becomes
evident that the data is biased
For example, in 1970, men in the USA were drafted based on their
date of birth following a sequence ranging from 1 to 366 using a
lottery process
Soon, complaints were raised that the lottery was biased: men born
later in the year had a greater chance of receiving a low draft
number, being drafted early
19. http://publicationslist.org/junio
LOESS (locally weighted regression)
As it can be seen, the plot of the points
shows no evidence of biasing or of any
kind of pattern
However, if LOESS is used to represent
the data as a smooth curve, it becomes
evident that the data is biased
For example, in 1970, men in the USA were drafted based on their
date of birth following a sequence ranging from 1 to 366 using a
lottery process
Soon, complaints were raised that the lottery was biased: men born
later in the year had a greater chance of receiving a low draft
number, being drafted early
In the plot, the filled line corresponds to h=5, while the
dashed line corresponds to h=100; this large value makes
LOESS behave like a simple linear regression
This example demonstrates that a smoother curve can
reveal more details than a stiff curve – such as a
straight line, which provides a global inspection with less
details
20. http://publicationslist.org/junio
LOESS (locally weighted regression)
Another example, consider the finishing times for the winners in a
marathon separated by men and women, data from 1900 up to 1990,
and prediction points up to 2000+
In this example, the stiff
curves wrongly show
that women should beat
men and continue on a
dramatic pace
The smooth curves
show that women times
tend to stabilize near
year 2000
21. http://publicationslist.org/junio
Residuals
Residuals refer to the remainder when you subtract the
smooth curve from the actual data
They should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distribution
with mean zero
This figure shows the
residuals for the
marathon data – only
women, for LOESS and
linear regression
LOESS shows
smaller values,
while the line shows
bigger values and
an increasing trend
for error
22. http://publicationslist.org/junio
Residuals
Residuals refer to the remainder when you subtract the
smooth curve from the actual data
They should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distribution
with mean zero
This figure shows the
residuals for the
marathon data – only
women, for LOESS and
linear regression
LOESS shows
smaller values,
while the line shows
bigger values and
an increasing trend
for error
X
Ok
23. http://publicationslist.org/junio
Residuals
Residuals refer to the remainder when you subtract the
smooth curve from the actual data
They should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distribution
with mean zero
This figure shows the
residuals for the
marathon data – only
women, for LOESS and
linear regression
LOESS shows
smaller values,
while the line shows
bigger values and
an increasing trend
for error
• It is important to analyze the residuals in order to
verify the adequacy of the smooth curve
• Good residuals should straddle the zero value all
over the data points, and should not present trends as,
for instance, increasing or decreasing
• Trends may reveal that the smooth curve is not
adequate or that it is adequate only for part of the data
domain
24. http://publicationslist.org/junio
Logarithmic plots
Logarithmic plots are based on the fundamental properties that
turn products into sums and powers into products
= +
=
There are single, or semi-logarithmic plots, and double, or
log-log, plots, depending on whether only one or both axes have
been scaled logarithmically
For example, consider the function y=C*exp( x), where C and
are constants, its single log plot is given by log y = log C + x, which
is a line with slope
28. http://publicationslist.org/junio
Logarithmic plots
Double logarithmic plots have the ability to reveal power-law
relationships as straight lines
Example: consider the heartbeat rate of mammals whose weight
ranges from a few kgs to 120 tons (the whale)
Simple plot Log-log plot
• In this example, the log plot reveals a line with slope -1/4,
the signature of its underlying power-law distribution
• It means that heart_rate = mass-1/4 (left picture) whose
logarithmic plot is given by log(heart_rate) = -1/4
log(mass) picture at the right
29. http://publicationslist.org/junio
Scaling for better visualization
Another technique to improve the power of a plot is to scale one,
or both, of its axes
For example, consider a data set of the annual sunspot count from
year 1700 to the year 2000
Despite one can see a
cyclic behavior, some
important details are not
evident
30. http://publicationslist.org/junio
Scaling for better visualization
The same data set can be better visualized if either the horizontal
axis or the vertical axis is scaled
Vertical-axis scale
Horizontal-axis scale (sliced to fit)
Some authors call this technique“banking” (?!)
33. http://publicationslist.org/junio
Mass as in function of height
What about a linear model to represent such data?
The model reasonably models the data, but let’s take a closer look
35. http://publicationslist.org/junio
Mass as in function of height
What about a logarithmic plot?
• Surprisingly, the cubic function represents the data a lot better
• Actually, this is no surprise, the weight is proportional to its
volume—that is, to height times width times depth or h · w · d,
and
• Since body proportions are pretty much the same for all humans –
a person who is twice as tall as another will have shoulders that
are twice as wide, too
• It follows that the volume of a person’s body (and hence its mass)
scales as the third power of the height: mass ∼ height3
37. http://publicationslist.org/junio
Mass as in function of height
Now back to the non-logarithmic plot and the cubic model
with final parameters obtained by trial and error
• The models seem a lot better now, but it has some
limitations on small and high heights
• Despite that, it can be reasonably used for prediction and
for understanding the data
39. http://publicationslist.org/junio
Mass as in function of height
Consider a group of people scheduled to perform some task.
The amount of work that this group can perform in a fixed
amount of time (its “throughput”) is proportional to the
number n of people on the team: ∼ n
However, the members will have to coordinate with each other.
Let’s assume that each member of the team needs to talk to
every other member at least once a day communication
overhead: ∼ -n2 (minus the loss in throughput.)
There is an optimal number of people for which the realized
productivity will be higher what is this number?
40. http://publicationslist.org/junio
Mass as in function of height
Consider that the problem can be modeled as:
= −
where n is the number of people, c is the number of minutes each
person can produce per day, and d is the number of minutes of each
communication event
Graphically, we can
analyze the problem
with three curves:
raw throughput: cn
comm. overhead: dn2
P(n)=cn - n2d
41. http://publicationslist.org/junio
Mass as in function of height
Consider that the problem can be modeled as:
= −
where n is the number of people, c is the number of minutes each
person can produce per day, and d is the number of minutes of each
communication event
Graphically, we can
analyze the problem
with three curves:
raw throughput: cn
comm. overhead: dn2
P(n)=cn - n2d
42. http://publicationslist.org/junio
Mass as in function of height
Consider that the problem can be modeled as:
= −
where n is the number of people, c is the number of minutes each
person can produce per day, and d is the number of minutes of each
communication event
Graphically, we can
analyze the problem
with three curves:
raw throughput: cn
comm. overhead: dn2
P(n)=cn - n2d
• But what is the best number?
• From the plot we see that there is a local maximum on
P(n)
• How to determine such maximum?
43. http://publicationslist.org/junio
Mass as in function of height
Local maximums answer for derivatives with value 0, so
To find the maximum, we take the derivative of P(n) set it equal
0, and solve for n
The result is noptimal = c/2d