SlideShare a Scribd company logo
1 of 45
Download to read offline
http://publicationslist.org/junio
Data Analysis
Two variables: establishing relationships
Prof. Dr. Jose Fernando Rodrigues Junior
ICMC-USP
http://publicationslist.org/junio
What is it about?
When dealing with two variables, the main interest is to
know if and how they are interrelated
To this end, plotting one variable against the other is the
straightforward course of action  Scatter Plots
http://publicationslist.org/junio
Scatter Plots (xy plot) - example
http://publicationslist.org/junio
Scatter Plots (xy plot)
Example: typical data as, for instance, the prevalence of skin
cancer as a function of the mean income for group of
individuals, or the unemployment rate as a function of the
frequency of highschool dropouts
http://publicationslist.org/junio
Scatter Plots (xy plot)
Example: typical data as, for instance, the prevalence of skin
cancer as a function of the mean income for group of
individuals, or the unemployment rate as a function of the
frequency of highschool dropouts
In this example, which is not rare, the plot is not conclusive
about the presence of a relationship
http://publicationslist.org/junio
Scatter Plots (xy plot)
Typical plots
No relationship Strong, simple relationship
Strong, not-simple relationship Multivariate relationship
http://publicationslist.org/junio
Linear regression
Given a controlled input variable x, and a corresponding output
response y, we are looking for a linear function f (x) = a + bx =
y that reproduces the response with the least amount of
error; a linear regression is a function that minimizes the error in
the responses for a given set of inputs
The technique must not be misunderstood as a summarization
technique, but rather as a prediction technique
http://publicationslist.org/junio
Linear regression
The math behind linear regression is surprisingly simple, what
makes it so popular (and misused, as well); its principle is to
minimize (on a and b) the squared difference between the
actual data and f(x) = a+bx
With a little algebra, the preferred values for a and b are given
by:
However, linear regression can be misleading
http://publicationslist.org/junio
Linear regression
Consider these four data sets, known as the Anscombe’s quartet:
http://publicationslist.org/junio
Linear regression
All the four data sets of the Anscombe’s quartet have the same
linear regression, however, they are essentially different
http://publicationslist.org/junio
Linear regression
All the four data sets of the Anscombe’s quartet have the same
linear regression, however, they are essentially different
• The first data set is represented correctly
• The second is not linear
• The third has an expressive outlier, not embraced by the regression
• The fourth does not have enough independent values x in order to
provide a linear regression (only two values: 8.0 and 19.0)
• The problem is even worse, the confidence intervals of the
data sets are all the same as well, so the problem is noticed
only when the data is plotted
 To verify a linear regression, a useful exercise is to verify
where the next response will fall into the plot – it is ok only if
the response falls in the line defined by the points already known
http://publicationslist.org/junio
Linear regression
Use linear regression only if:
 the data can be described by a straight line
 the data is well-behaved, that is, no expressive outliers
 there are enough values for the controlled variable
In any case, linear regression must be accompanied with a
scatter plot so that visual verification is possible
http://publicationslist.org/junio
Dealing with noisy data
 When the data is noisy, it is often helpful to find a smooth curve that
represents it so that trends and structure can be more easily noticed
Two methods are frequently used: weighted splines (Splines) and
locally weighted regression (LOESS or LOWESS)
Both work by approximating the data in a small neighborhood
(locally) by a polynomial of low order (at most cubic), following
an adjustable parameter that controls the stiffness of the curve
The stiffer the curve, the smoother it appears but the less accurately
it can follow the individual data points  balancing smoothness
and accuracy is the challenge here
http://publicationslist.org/junio
Splines
Splines are constructed from piecewise polynomial functions
(typically cubic) that are joined together in a smooth fashion
Cubic interpolation polynomials for each consecutive pair of
points and required, so that these individual polynomials have the
same values, as well as the same first and second derivatives, at the
points where they meet; these smoothness conditions lead to a
set of linear equations for the coefficients in the
polynomials, which can be solved and the spline curve can be
evaluated at any desired location
http://publicationslist.org/junio
Splines
1st term 2nd term
 In addition to the local smoothness requirements at each joint, splines must also satisfy a
global smoothness condition by optimizing (minimizing) the functional:
where s(t) is the spline curve, (xi, yi) are the coordinates of the two-variables data points, wi
are weight factors (one for each point), and is a mixing factor
 The 1st term controls how wiggly the spline is – many wiggles lead to large second
derivatives; the 2nd term captures how accurately the spline represents the data
points by measuring the squared deviation of the spline from each data point
 The wi values can be given by wi=1/ , where di measures how close the spline should
pass by (xi,yi), that is, greater weights for points that the spline should be close
(previously chosen pivots, for example)
 The value mixes the importance of the 1st ( ) and the 2nd (1 − ) terms, balancing
smoothness and accuracy; high values will avoid wiggly curves, and low values will
lead to more precise, though, less sooth curves  the main parameter for off-the-
shelf plotting software
http://publicationslist.org/junio
Wiggly
Wiggly: more precision, less smoothness Non-wiggly: less precision, more smoothness
http://publicationslist.org/junio
LOESS (locally weighted regression)
 LOESS consists of approximating the data locally through a low-order
(typically linear) polynomial (regression), while weighting all the data points
in such a way that points close to the location of interest contribute
more strongly than do data points farther away (local weighting)
 Its linear case finds parameters a and b that minimize the least-squares
equality:
where a+bxi-yi is the LOESS curve at (xi, yi) and w(x) is the weight
function – usually a smooth and peaked kernel as
= (1 − | | ) 	 	 < 1; 	0	 ℎ ;
 Notice how the weighting function is sensible to the distance
between point x and all the other xi points
 LOESS is computationally intensive, as the entire calculation must be
performed for every point at which we want to obtain a smoothed value
http://publicationslist.org/junio
LOESS (locally weighted regression)
 As it can be seen, the plot of the points
shows no evidence of biasing or of any
kind of pattern
 However, if LOESS is used to represent
the data as a smooth curve, it becomes
evident that the data is biased
For example, in 1970, men in the USA were drafted based on their
date of birth following a sequence ranging from 1 to 366 using a
lottery process
Soon, complaints were raised that the lottery was biased: men born
later in the year had a greater chance of receiving a low draft
number, being drafted early
http://publicationslist.org/junio
LOESS (locally weighted regression)
 As it can be seen, the plot of the points
shows no evidence of biasing or of any
kind of pattern
 However, if LOESS is used to represent
the data as a smooth curve, it becomes
evident that the data is biased
For example, in 1970, men in the USA were drafted based on their
date of birth following a sequence ranging from 1 to 366 using a
lottery process
Soon, complaints were raised that the lottery was biased: men born
later in the year had a greater chance of receiving a low draft
number, being drafted early
In the plot, the filled line corresponds to h=5, while the
dashed line corresponds to h=100; this large value makes
LOESS behave like a simple linear regression
This example demonstrates that a smoother curve can
reveal more details than a stiff curve – such as a
straight line, which provides a global inspection with less
details
http://publicationslist.org/junio
LOESS (locally weighted regression)
Another example, consider the finishing times for the winners in a
marathon separated by men and women, data from 1900 up to 1990,
and prediction points up to 2000+
In this example, the stiff
curves wrongly show
that women should beat
men and continue on a
dramatic pace
The smooth curves
show that women times
tend to stabilize near
year 2000
http://publicationslist.org/junio
Residuals
Residuals refer to the remainder when you subtract the
smooth curve from the actual data
They should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distribution
with mean zero
This figure shows the
residuals for the
marathon data – only
women, for LOESS and
linear regression
LOESS shows
smaller values,
while the line shows
bigger values and
an increasing trend
for error
http://publicationslist.org/junio
Residuals
Residuals refer to the remainder when you subtract the
smooth curve from the actual data
They should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distribution
with mean zero
This figure shows the
residuals for the
marathon data – only
women, for LOESS and
linear regression
LOESS shows
smaller values,
while the line shows
bigger values and
an increasing trend
for error
X
Ok
http://publicationslist.org/junio
Residuals
Residuals refer to the remainder when you subtract the
smooth curve from the actual data
They should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distribution
with mean zero
This figure shows the
residuals for the
marathon data – only
women, for LOESS and
linear regression
LOESS shows
smaller values,
while the line shows
bigger values and
an increasing trend
for error
• It is important to analyze the residuals in order to
verify the adequacy of the smooth curve
• Good residuals should straddle the zero value all
over the data points, and should not present trends as,
for instance, increasing or decreasing
• Trends may reveal that the smooth curve is not
adequate or that it is adequate only for part of the data
domain
http://publicationslist.org/junio
Logarithmic plots
Logarithmic plots are based on the fundamental properties that
turn products into sums and powers into products
= +
= 	
There are single, or semi-logarithmic plots, and double, or
log-log, plots, depending on whether only one or both axes have
been scaled logarithmically
For example, consider the function y=C*exp( x), where C and
are constants, its single log plot is given by log y = log C + x, which
is a line with slope
http://publicationslist.org/junio
Logarithmic plots
Example
In the example, 3 functions:
f(x)=10x, f(x)=x, and
f(x)=log(x)
Observe how the axes
scale and how the curves
turn out into lines
http://publicationslist.org/junio
Logarithmic plots
Example: here the use of log permits to compare values that span
over a large range
http://publicationslist.org/junio
Logarithmic plots
Double logarithmic plots have the ability to reveal power-law
relationships as straight lines
Example: consider the heartbeat rate of mammals whose weight
ranges from a few kgs to 120 tons (the whale)
Simple plot Log-log plot
http://publicationslist.org/junio
Logarithmic plots
Double logarithmic plots have the ability to reveal power-law
relationships as straight lines
Example: consider the heartbeat rate of mammals whose weight
ranges from a few kgs to 120 tons (the whale)
Simple plot Log-log plot
• In this example, the log plot reveals a line with slope -1/4,
the signature of its underlying power-law distribution
• It means that heart_rate = mass-1/4 (left picture) whose
logarithmic plot is given by log(heart_rate) = -1/4
log(mass)  picture at the right
http://publicationslist.org/junio
Scaling for better visualization
Another technique to improve the power of a plot is to scale one,
or both, of its axes
For example, consider a data set of the annual sunspot count from
year 1700 to the year 2000
Despite one can see a
cyclic behavior, some
important details are not
evident
http://publicationslist.org/junio
Scaling for better visualization
The same data set can be better visualized if either the horizontal
axis or the vertical axis is scaled
Vertical-axis scale
Horizontal-axis scale (sliced to fit)
Some authors call this technique“banking” (?!)
http://publicationslist.org/junio
Example, modeling two-variable data
http://publicationslist.org/junio
Mass as in function of height
Consider a dataset with two attributes, the height and the mass
of individuals
http://publicationslist.org/junio
Mass as in function of height
What about a linear model to represent such data?
The model reasonably models the data, but let’s take a closer look
http://publicationslist.org/junio
Mass as in function of height
What about a logarithmic plot?
http://publicationslist.org/junio
Mass as in function of height
What about a logarithmic plot?
• Surprisingly, the cubic function represents the data a lot better
• Actually, this is no surprise, the weight is proportional to its
volume—that is, to height times width times depth or h · w · d,
and
• Since body proportions are pretty much the same for all humans –
a person who is twice as tall as another will have shoulders that
are twice as wide, too
• It follows that the volume of a person’s body (and hence its mass)
scales as the third power of the height: mass ∼ height3
http://publicationslist.org/junio
Mass as in function of height
Now back to the non-logarithmic plot and the cubic model
with final parameters obtained by trial and error
http://publicationslist.org/junio
Mass as in function of height
Now back to the non-logarithmic plot and the cubic model
with final parameters obtained by trial and error
• The models seem a lot better now, but it has some
limitations on small and high heights
• Despite that, it can be reasonably used for prediction and
for understanding the data
http://publicationslist.org/junio
Example, optimizing two-variable data
http://publicationslist.org/junio
Mass as in function of height
Consider a group of people scheduled to perform some task.
The amount of work that this group can perform in a fixed
amount of time (its “throughput”) is proportional to the
number n of people on the team: ∼ n
However, the members will have to coordinate with each other.
Let’s assume that each member of the team needs to talk to
every other member at least once a day  communication
overhead: ∼ -n2 (minus the loss in throughput.)
There is an optimal number of people for which the realized
productivity will be higher  what is this number?
http://publicationslist.org/junio
Mass as in function of height
Consider that the problem can be modeled as:
= −
where n is the number of people, c is the number of minutes each
person can produce per day, and d is the number of minutes of each
communication event
Graphically, we can
analyze the problem
with three curves:
 raw throughput: cn
 comm. overhead: dn2
 P(n)=cn - n2d
http://publicationslist.org/junio
Mass as in function of height
Consider that the problem can be modeled as:
= −
where n is the number of people, c is the number of minutes each
person can produce per day, and d is the number of minutes of each
communication event
Graphically, we can
analyze the problem
with three curves:
 raw throughput: cn
 comm. overhead: dn2
 P(n)=cn - n2d
http://publicationslist.org/junio
Mass as in function of height
Consider that the problem can be modeled as:
= −
where n is the number of people, c is the number of minutes each
person can produce per day, and d is the number of minutes of each
communication event
Graphically, we can
analyze the problem
with three curves:
 raw throughput: cn
 comm. overhead: dn2
 P(n)=cn - n2d
• But what is the best number?
• From the plot we see that there is a local maximum on
P(n)
• How to determine such maximum?
http://publicationslist.org/junio
Mass as in function of height
Local maximums answer for derivatives with value 0, so
To find the maximum, we take the derivative of P(n) set it equal
0, and solve for n
The result is noptimal = c/2d
http://publicationslist.org/junio
Mass as in function of height
P’(n) = c – 2dn
c – 2dn = 0
n = c/2d
http://publicationslist.org/junio
References
 Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010.
 Wikipedia, http://en.wikipedia.org
 Wolfram MathWorld, http://mathworld.wolfram.com/

More Related Content

What's hot

Types Of Charts
Types Of ChartsTypes Of Charts
Types Of Chartslindy23
 
Analysing charts and graphics
Analysing charts and graphicsAnalysing charts and graphics
Analysing charts and graphicsŠkola Futura
 
Types of charts in Excel and How to use them
Types of charts in Excel and How to use themTypes of charts in Excel and How to use them
Types of charts in Excel and How to use themVijay Perepa
 
Data Analysis Section
Data Analysis SectionData Analysis Section
Data Analysis Sectionmsrichards
 
Types of graphs and charts and their uses with examples and pics
Types of graphs and charts and their uses  with examples and picsTypes of graphs and charts and their uses  with examples and pics
Types of graphs and charts and their uses with examples and picsMakati Science High School
 
TID Chapter 5 Introduction To Charts And Graph
TID Chapter 5 Introduction To Charts And GraphTID Chapter 5 Introduction To Charts And Graph
TID Chapter 5 Introduction To Charts And GraphWanBK Leo
 
Graphs and chars
Graphs and charsGraphs and chars
Graphs and charssalemhusin
 
Different types of charts
Different types of chartsDifferent types of charts
Different types of chartsZakaria Salim
 
Statistical measures box plots
Statistical measures   box plotsStatistical measures   box plots
Statistical measures box plotsjaflint718
 
Displaying data using charts and graphs
Displaying data using charts and graphsDisplaying data using charts and graphs
Displaying data using charts and graphsCharles Flynt
 
Data visualization 101_how_to_design_charts_and_graphs
Data visualization 101_how_to_design_charts_and_graphsData visualization 101_how_to_design_charts_and_graphs
Data visualization 101_how_to_design_charts_and_graphsAtner Yegorov
 
Tables, Graphs, and Charts Social Studies
Tables, Graphs, and Charts Social StudiesTables, Graphs, and Charts Social Studies
Tables, Graphs, and Charts Social StudiesLyn Gile Facebook
 
Introduction to graph
Introduction to graphIntroduction to graph
Introduction to graphRoyB
 
Interpret data for use in charts and graphs
Interpret data for use in charts and graphsInterpret data for use in charts and graphs
Interpret data for use in charts and graphsCharles Flynt
 
Statistics
StatisticsStatistics
Statisticsdiereck
 

What's hot (20)

Types Of Charts
Types Of ChartsTypes Of Charts
Types Of Charts
 
Analysing charts and graphics
Analysing charts and graphicsAnalysing charts and graphics
Analysing charts and graphics
 
Types of charts in Excel and How to use them
Types of charts in Excel and How to use themTypes of charts in Excel and How to use them
Types of charts in Excel and How to use them
 
Types of Charts
Types of ChartsTypes of Charts
Types of Charts
 
Charts And Graphs
Charts And GraphsCharts And Graphs
Charts And Graphs
 
Data Analysis Section
Data Analysis SectionData Analysis Section
Data Analysis Section
 
Types of graphs and charts and their uses with examples and pics
Types of graphs and charts and their uses  with examples and picsTypes of graphs and charts and their uses  with examples and pics
Types of graphs and charts and their uses with examples and pics
 
Types of Chart
Types of ChartTypes of Chart
Types of Chart
 
Graphing
GraphingGraphing
Graphing
 
TID Chapter 5 Introduction To Charts And Graph
TID Chapter 5 Introduction To Charts And GraphTID Chapter 5 Introduction To Charts And Graph
TID Chapter 5 Introduction To Charts And Graph
 
Graphs and chars
Graphs and charsGraphs and chars
Graphs and chars
 
Different types of charts
Different types of chartsDifferent types of charts
Different types of charts
 
Statistical measures box plots
Statistical measures   box plotsStatistical measures   box plots
Statistical measures box plots
 
Displaying data using charts and graphs
Displaying data using charts and graphsDisplaying data using charts and graphs
Displaying data using charts and graphs
 
Data visualization 101_how_to_design_charts_and_graphs
Data visualization 101_how_to_design_charts_and_graphsData visualization 101_how_to_design_charts_and_graphs
Data visualization 101_how_to_design_charts_and_graphs
 
Tables, Graphs, and Charts Social Studies
Tables, Graphs, and Charts Social StudiesTables, Graphs, and Charts Social Studies
Tables, Graphs, and Charts Social Studies
 
Introduction to graph
Introduction to graphIntroduction to graph
Introduction to graph
 
Interpret data for use in charts and graphs
Interpret data for use in charts and graphsInterpret data for use in charts and graphs
Interpret data for use in charts and graphs
 
Statistics
StatisticsStatistics
Statistics
 
LIB300 Using Visuals week 6
LIB300 Using Visuals week 6LIB300 Using Visuals week 6
LIB300 Using Visuals week 6
 

Viewers also liked

Eubank(1999)nonparametric regressionandsplinesmoothing
Eubank(1999)nonparametric regressionandsplinesmoothingEubank(1999)nonparametric regressionandsplinesmoothing
Eubank(1999)nonparametric regressionandsplinesmoothingariefunhas
 
Large sample property of the bayes factor in a spline semiparametric regressi...
Large sample property of the bayes factor in a spline semiparametric regressi...Large sample property of the bayes factor in a spline semiparametric regressi...
Large sample property of the bayes factor in a spline semiparametric regressi...Alexander Decker
 
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...SSA KPI
 
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014 Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014 Persontyle
 
General Additive Models in R
General Additive Models in RGeneral Additive Models in R
General Additive Models in RNoam Ross
 
Multivariate adaptive regression splines
Multivariate adaptive regression splinesMultivariate adaptive regression splines
Multivariate adaptive regression splinesEklavya Gupta
 
Introduction to MARS (1999)
Introduction to MARS (1999)Introduction to MARS (1999)
Introduction to MARS (1999)Salford Systems
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival AnalysisDerek Kane
 

Viewers also liked (14)

Spline Regressions
Spline RegressionsSpline Regressions
Spline Regressions
 
Eubank(1999)nonparametric regressionandsplinesmoothing
Eubank(1999)nonparametric regressionandsplinesmoothingEubank(1999)nonparametric regressionandsplinesmoothing
Eubank(1999)nonparametric regressionandsplinesmoothing
 
Large sample property of the bayes factor in a spline semiparametric regressi...
Large sample property of the bayes factor in a spline semiparametric regressi...Large sample property of the bayes factor in a spline semiparametric regressi...
Large sample property of the bayes factor in a spline semiparametric regressi...
 
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
On Foundations of Parameter Estimation for Generalized Partial Linear Models ...
 
1801 1805
1801 18051801 1805
1801 1805
 
Lecture5 kernel svm
Lecture5 kernel svmLecture5 kernel svm
Lecture5 kernel svm
 
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014 Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014
 
Regression
RegressionRegression
Regression
 
General Additive Models in R
General Additive Models in RGeneral Additive Models in R
General Additive Models in R
 
Multivariate adaptive regression splines
Multivariate adaptive regression splinesMultivariate adaptive regression splines
Multivariate adaptive regression splines
 
Predictions from MARS
Predictions from MARSPredictions from MARS
Predictions from MARS
 
Introduction to MARS (1999)
Introduction to MARS (1999)Introduction to MARS (1999)
Introduction to MARS (1999)
 
Introduction to mars_2009
Introduction to mars_2009Introduction to mars_2009
Introduction to mars_2009
 
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science -  Part XV - MARS, Logistic Regression, & Survival AnalysisData Science -  Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
 

Similar to Data analysis02 twovariables

105575916 maths-edit-new
105575916 maths-edit-new105575916 maths-edit-new
105575916 maths-edit-newhomeworkping7
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierAl Arizmendez
 
Conditional Correlation 2009
Conditional Correlation 2009Conditional Correlation 2009
Conditional Correlation 2009yamanote
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regressionLong Beach City College
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusShuai Yuan
 
Linear regression
Linear regressionLinear regression
Linear regressionDepEd
 
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTIONLINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTIONruhila bhat
 
Maths A - Chapter 11
Maths A - Chapter 11Maths A - Chapter 11
Maths A - Chapter 11westy67968
 
Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptxssuserb8a904
 
Regression analysis algorithm
Regression analysis algorithm Regression analysis algorithm
Regression analysis algorithm Sammer Qader
 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayCrystal Alvarez
 
How to draw a good graph
How to draw a good graphHow to draw a good graph
How to draw a good graphTarun Gehlot
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxhyacinthshackley2629
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxnovabroom
 
Requirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docxRequirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docxheunice
 

Similar to Data analysis02 twovariables (20)

105575916 maths-edit-new
105575916 maths-edit-new105575916 maths-edit-new
105575916 maths-edit-new
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
 
Conditional Correlation 2009
Conditional Correlation 2009Conditional Correlation 2009
Conditional Correlation 2009
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression
 
Data analysis04 morethantwovariables
Data analysis04 morethantwovariablesData analysis04 morethantwovariables
Data analysis04 morethantwovariables
 
Data analysis03 timeasa-variable
Data analysis03 timeasa-variableData analysis03 timeasa-variable
Data analysis03 timeasa-variable
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 census
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Lesson 2 2
Lesson 2 2Lesson 2 2
Lesson 2 2
 
Data analysis00 commonprobabilitymodels
Data analysis00 commonprobabilitymodelsData analysis00 commonprobabilitymodels
Data analysis00 commonprobabilitymodels
 
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTIONLINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
LINE AND SCATTER DIAGRAM,FREQUENCY DISTRIBUTION
 
Maths A - Chapter 11
Maths A - Chapter 11Maths A - Chapter 11
Maths A - Chapter 11
 
Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptx
 
Chapter05
Chapter05Chapter05
Chapter05
 
Regression analysis algorithm
Regression analysis algorithm Regression analysis algorithm
Regression analysis algorithm
 
Evaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis EssayEvaluation Of A Correlation Analysis Essay
Evaluation Of A Correlation Analysis Essay
 
How to draw a good graph
How to draw a good graphHow to draw a good graph
How to draw a good graph
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docx
 
Requirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docxRequirements.docxRequirementsFont Times New RomanI NEED .docx
Requirements.docxRequirementsFont Times New RomanI NEED .docx
 

More from Universidade de São Paulo

Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopIntrodução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopUniversidade de São Paulo
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...Universidade de São Paulo
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Universidade de São Paulo
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Universidade de São Paulo
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUniversidade de São Paulo
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsUniversidade de São Paulo
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelUniversidade de São Paulo
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...Universidade de São Paulo
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesUniversidade de São Paulo
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Universidade de São Paulo
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkUniversidade de São Paulo
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyUniversidade de São Paulo
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Universidade de São Paulo
 
Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsUniversidade de São Paulo
 

More from Universidade de São Paulo (20)

A gentle introduction to Deep Learning
A gentle introduction to Deep LearningA gentle introduction to Deep Learning
A gentle introduction to Deep Learning
 
Computação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalhoComputação: carreira e mercado de trabalho
Computação: carreira e mercado de trabalho
 
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema HadoopIntrodução às ferramentas de Business Intelligence do ecossistema Hadoop
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
 
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...
 
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...
 
Unveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approachUnveiling smoke in social images with the SmokeBlock approach
Unveiling smoke in social images with the SmokeBlock approach
 
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsVertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale Graphs
 
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelFast Billion-scale Graph Computation Using a Bimodal Block Processing Model
Fast Billion-scale Graph Computation Using a Bimodal Block Processing Model
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
 
StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...StructMatrix: large-scale visualization of graphs by means of structure detec...
StructMatrix: large-scale visualization of graphs by means of structure detec...
 
Apresentacao vldb
Apresentacao vldbApresentacao vldb
Apresentacao vldb
 
Techniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media imagesTechniques for effective and efficient fire detection from social media images
Techniques for effective and efficient fire detection from social media images
 
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Multimodal graph-based analysis over the DBLP repository: critical discoverie...
Multimodal graph-based analysis over the DBLP repository: critical discoverie...
 
Supervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring networkSupervised-Learning Link Recommendation in the DBLP co-authoring network
Supervised-Learning Link Recommendation in the DBLP co-authoring network
 
Graph-based Relational Data Visualization
Graph-based RelationalData VisualizationGraph-based RelationalData Visualization
Graph-based Relational Data Visualization
 
Reviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical StudyReviewing Data Visualization: an Analytical Taxonomical Study
Reviewing Data Visualization: an Analytical Taxonomical Study
 
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
Complexidade de Algoritmos, Notação assintótica, Algoritmos polinomiais e in...
 
Dawarehouse e OLAP
Dawarehouse e OLAPDawarehouse e OLAP
Dawarehouse e OLAP
 
Visualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisionsVisualization tree multiple linked analytical decisions
Visualization tree multiple linked analytical decisions
 

Data analysis02 twovariables

  • 1. http://publicationslist.org/junio Data Analysis Two variables: establishing relationships Prof. Dr. Jose Fernando Rodrigues Junior ICMC-USP
  • 2. http://publicationslist.org/junio What is it about? When dealing with two variables, the main interest is to know if and how they are interrelated To this end, plotting one variable against the other is the straightforward course of action  Scatter Plots
  • 4. http://publicationslist.org/junio Scatter Plots (xy plot) Example: typical data as, for instance, the prevalence of skin cancer as a function of the mean income for group of individuals, or the unemployment rate as a function of the frequency of highschool dropouts
  • 5. http://publicationslist.org/junio Scatter Plots (xy plot) Example: typical data as, for instance, the prevalence of skin cancer as a function of the mean income for group of individuals, or the unemployment rate as a function of the frequency of highschool dropouts In this example, which is not rare, the plot is not conclusive about the presence of a relationship
  • 6. http://publicationslist.org/junio Scatter Plots (xy plot) Typical plots No relationship Strong, simple relationship Strong, not-simple relationship Multivariate relationship
  • 7. http://publicationslist.org/junio Linear regression Given a controlled input variable x, and a corresponding output response y, we are looking for a linear function f (x) = a + bx = y that reproduces the response with the least amount of error; a linear regression is a function that minimizes the error in the responses for a given set of inputs The technique must not be misunderstood as a summarization technique, but rather as a prediction technique
  • 8. http://publicationslist.org/junio Linear regression The math behind linear regression is surprisingly simple, what makes it so popular (and misused, as well); its principle is to minimize (on a and b) the squared difference between the actual data and f(x) = a+bx With a little algebra, the preferred values for a and b are given by: However, linear regression can be misleading
  • 9. http://publicationslist.org/junio Linear regression Consider these four data sets, known as the Anscombe’s quartet:
  • 10. http://publicationslist.org/junio Linear regression All the four data sets of the Anscombe’s quartet have the same linear regression, however, they are essentially different
  • 11. http://publicationslist.org/junio Linear regression All the four data sets of the Anscombe’s quartet have the same linear regression, however, they are essentially different • The first data set is represented correctly • The second is not linear • The third has an expressive outlier, not embraced by the regression • The fourth does not have enough independent values x in order to provide a linear regression (only two values: 8.0 and 19.0) • The problem is even worse, the confidence intervals of the data sets are all the same as well, so the problem is noticed only when the data is plotted  To verify a linear regression, a useful exercise is to verify where the next response will fall into the plot – it is ok only if the response falls in the line defined by the points already known
  • 12. http://publicationslist.org/junio Linear regression Use linear regression only if:  the data can be described by a straight line  the data is well-behaved, that is, no expressive outliers  there are enough values for the controlled variable In any case, linear regression must be accompanied with a scatter plot so that visual verification is possible
  • 13. http://publicationslist.org/junio Dealing with noisy data  When the data is noisy, it is often helpful to find a smooth curve that represents it so that trends and structure can be more easily noticed Two methods are frequently used: weighted splines (Splines) and locally weighted regression (LOESS or LOWESS) Both work by approximating the data in a small neighborhood (locally) by a polynomial of low order (at most cubic), following an adjustable parameter that controls the stiffness of the curve The stiffer the curve, the smoother it appears but the less accurately it can follow the individual data points  balancing smoothness and accuracy is the challenge here
  • 14. http://publicationslist.org/junio Splines Splines are constructed from piecewise polynomial functions (typically cubic) that are joined together in a smooth fashion Cubic interpolation polynomials for each consecutive pair of points and required, so that these individual polynomials have the same values, as well as the same first and second derivatives, at the points where they meet; these smoothness conditions lead to a set of linear equations for the coefficients in the polynomials, which can be solved and the spline curve can be evaluated at any desired location
  • 15. http://publicationslist.org/junio Splines 1st term 2nd term  In addition to the local smoothness requirements at each joint, splines must also satisfy a global smoothness condition by optimizing (minimizing) the functional: where s(t) is the spline curve, (xi, yi) are the coordinates of the two-variables data points, wi are weight factors (one for each point), and is a mixing factor  The 1st term controls how wiggly the spline is – many wiggles lead to large second derivatives; the 2nd term captures how accurately the spline represents the data points by measuring the squared deviation of the spline from each data point  The wi values can be given by wi=1/ , where di measures how close the spline should pass by (xi,yi), that is, greater weights for points that the spline should be close (previously chosen pivots, for example)  The value mixes the importance of the 1st ( ) and the 2nd (1 − ) terms, balancing smoothness and accuracy; high values will avoid wiggly curves, and low values will lead to more precise, though, less sooth curves  the main parameter for off-the- shelf plotting software
  • 16. http://publicationslist.org/junio Wiggly Wiggly: more precision, less smoothness Non-wiggly: less precision, more smoothness
  • 17. http://publicationslist.org/junio LOESS (locally weighted regression)  LOESS consists of approximating the data locally through a low-order (typically linear) polynomial (regression), while weighting all the data points in such a way that points close to the location of interest contribute more strongly than do data points farther away (local weighting)  Its linear case finds parameters a and b that minimize the least-squares equality: where a+bxi-yi is the LOESS curve at (xi, yi) and w(x) is the weight function – usually a smooth and peaked kernel as = (1 − | | ) < 1; 0 ℎ ;  Notice how the weighting function is sensible to the distance between point x and all the other xi points  LOESS is computationally intensive, as the entire calculation must be performed for every point at which we want to obtain a smoothed value
  • 18. http://publicationslist.org/junio LOESS (locally weighted regression)  As it can be seen, the plot of the points shows no evidence of biasing or of any kind of pattern  However, if LOESS is used to represent the data as a smooth curve, it becomes evident that the data is biased For example, in 1970, men in the USA were drafted based on their date of birth following a sequence ranging from 1 to 366 using a lottery process Soon, complaints were raised that the lottery was biased: men born later in the year had a greater chance of receiving a low draft number, being drafted early
  • 19. http://publicationslist.org/junio LOESS (locally weighted regression)  As it can be seen, the plot of the points shows no evidence of biasing or of any kind of pattern  However, if LOESS is used to represent the data as a smooth curve, it becomes evident that the data is biased For example, in 1970, men in the USA were drafted based on their date of birth following a sequence ranging from 1 to 366 using a lottery process Soon, complaints were raised that the lottery was biased: men born later in the year had a greater chance of receiving a low draft number, being drafted early In the plot, the filled line corresponds to h=5, while the dashed line corresponds to h=100; this large value makes LOESS behave like a simple linear regression This example demonstrates that a smoother curve can reveal more details than a stiff curve – such as a straight line, which provides a global inspection with less details
  • 20. http://publicationslist.org/junio LOESS (locally weighted regression) Another example, consider the finishing times for the winners in a marathon separated by men and women, data from 1900 up to 1990, and prediction points up to 2000+ In this example, the stiff curves wrongly show that women should beat men and continue on a dramatic pace The smooth curves show that women times tend to stabilize near year 2000
  • 21. http://publicationslist.org/junio Residuals Residuals refer to the remainder when you subtract the smooth curve from the actual data They should be balanced, that is, be symmetrically distributed around zero, preferably according to a Gaussian distribution with mean zero This figure shows the residuals for the marathon data – only women, for LOESS and linear regression LOESS shows smaller values, while the line shows bigger values and an increasing trend for error
  • 22. http://publicationslist.org/junio Residuals Residuals refer to the remainder when you subtract the smooth curve from the actual data They should be balanced, that is, be symmetrically distributed around zero, preferably according to a Gaussian distribution with mean zero This figure shows the residuals for the marathon data – only women, for LOESS and linear regression LOESS shows smaller values, while the line shows bigger values and an increasing trend for error X Ok
  • 23. http://publicationslist.org/junio Residuals Residuals refer to the remainder when you subtract the smooth curve from the actual data They should be balanced, that is, be symmetrically distributed around zero, preferably according to a Gaussian distribution with mean zero This figure shows the residuals for the marathon data – only women, for LOESS and linear regression LOESS shows smaller values, while the line shows bigger values and an increasing trend for error • It is important to analyze the residuals in order to verify the adequacy of the smooth curve • Good residuals should straddle the zero value all over the data points, and should not present trends as, for instance, increasing or decreasing • Trends may reveal that the smooth curve is not adequate or that it is adequate only for part of the data domain
  • 24. http://publicationslist.org/junio Logarithmic plots Logarithmic plots are based on the fundamental properties that turn products into sums and powers into products = + = There are single, or semi-logarithmic plots, and double, or log-log, plots, depending on whether only one or both axes have been scaled logarithmically For example, consider the function y=C*exp( x), where C and are constants, its single log plot is given by log y = log C + x, which is a line with slope
  • 25. http://publicationslist.org/junio Logarithmic plots Example In the example, 3 functions: f(x)=10x, f(x)=x, and f(x)=log(x) Observe how the axes scale and how the curves turn out into lines
  • 26. http://publicationslist.org/junio Logarithmic plots Example: here the use of log permits to compare values that span over a large range
  • 27. http://publicationslist.org/junio Logarithmic plots Double logarithmic plots have the ability to reveal power-law relationships as straight lines Example: consider the heartbeat rate of mammals whose weight ranges from a few kgs to 120 tons (the whale) Simple plot Log-log plot
  • 28. http://publicationslist.org/junio Logarithmic plots Double logarithmic plots have the ability to reveal power-law relationships as straight lines Example: consider the heartbeat rate of mammals whose weight ranges from a few kgs to 120 tons (the whale) Simple plot Log-log plot • In this example, the log plot reveals a line with slope -1/4, the signature of its underlying power-law distribution • It means that heart_rate = mass-1/4 (left picture) whose logarithmic plot is given by log(heart_rate) = -1/4 log(mass)  picture at the right
  • 29. http://publicationslist.org/junio Scaling for better visualization Another technique to improve the power of a plot is to scale one, or both, of its axes For example, consider a data set of the annual sunspot count from year 1700 to the year 2000 Despite one can see a cyclic behavior, some important details are not evident
  • 30. http://publicationslist.org/junio Scaling for better visualization The same data set can be better visualized if either the horizontal axis or the vertical axis is scaled Vertical-axis scale Horizontal-axis scale (sliced to fit) Some authors call this technique“banking” (?!)
  • 32. http://publicationslist.org/junio Mass as in function of height Consider a dataset with two attributes, the height and the mass of individuals
  • 33. http://publicationslist.org/junio Mass as in function of height What about a linear model to represent such data? The model reasonably models the data, but let’s take a closer look
  • 34. http://publicationslist.org/junio Mass as in function of height What about a logarithmic plot?
  • 35. http://publicationslist.org/junio Mass as in function of height What about a logarithmic plot? • Surprisingly, the cubic function represents the data a lot better • Actually, this is no surprise, the weight is proportional to its volume—that is, to height times width times depth or h · w · d, and • Since body proportions are pretty much the same for all humans – a person who is twice as tall as another will have shoulders that are twice as wide, too • It follows that the volume of a person’s body (and hence its mass) scales as the third power of the height: mass ∼ height3
  • 36. http://publicationslist.org/junio Mass as in function of height Now back to the non-logarithmic plot and the cubic model with final parameters obtained by trial and error
  • 37. http://publicationslist.org/junio Mass as in function of height Now back to the non-logarithmic plot and the cubic model with final parameters obtained by trial and error • The models seem a lot better now, but it has some limitations on small and high heights • Despite that, it can be reasonably used for prediction and for understanding the data
  • 39. http://publicationslist.org/junio Mass as in function of height Consider a group of people scheduled to perform some task. The amount of work that this group can perform in a fixed amount of time (its “throughput”) is proportional to the number n of people on the team: ∼ n However, the members will have to coordinate with each other. Let’s assume that each member of the team needs to talk to every other member at least once a day  communication overhead: ∼ -n2 (minus the loss in throughput.) There is an optimal number of people for which the realized productivity will be higher  what is this number?
  • 40. http://publicationslist.org/junio Mass as in function of height Consider that the problem can be modeled as: = − where n is the number of people, c is the number of minutes each person can produce per day, and d is the number of minutes of each communication event Graphically, we can analyze the problem with three curves:  raw throughput: cn  comm. overhead: dn2  P(n)=cn - n2d
  • 41. http://publicationslist.org/junio Mass as in function of height Consider that the problem can be modeled as: = − where n is the number of people, c is the number of minutes each person can produce per day, and d is the number of minutes of each communication event Graphically, we can analyze the problem with three curves:  raw throughput: cn  comm. overhead: dn2  P(n)=cn - n2d
  • 42. http://publicationslist.org/junio Mass as in function of height Consider that the problem can be modeled as: = − where n is the number of people, c is the number of minutes each person can produce per day, and d is the number of minutes of each communication event Graphically, we can analyze the problem with three curves:  raw throughput: cn  comm. overhead: dn2  P(n)=cn - n2d • But what is the best number? • From the plot we see that there is a local maximum on P(n) • How to determine such maximum?
  • 43. http://publicationslist.org/junio Mass as in function of height Local maximums answer for derivatives with value 0, so To find the maximum, we take the derivative of P(n) set it equal 0, and solve for n The result is noptimal = c/2d
  • 44. http://publicationslist.org/junio Mass as in function of height P’(n) = c – 2dn c – 2dn = 0 n = c/2d
  • 45. http://publicationslist.org/junio References  Philipp K. Janert, Data Analysis with Open Source Tools, O’Reilly, 2010.  Wikipedia, http://en.wikipedia.org  Wolfram MathWorld, http://mathworld.wolfram.com/