Scientific visualization of data
Research Methodology
Seppo Karrila
September 2017 (2560 Thai)
Executive summary
• This is about visualizing usually smallish sets
of experimental data from lab, for use in
science
– This is NOT about beautiful impressive artistic and
emotionally enticing “infographics” to average
consumers
• Emphasis is on insights, creating or
corroborating hypotheses, or assessing
research questions
The purpose of visualization
• Sight is your most important sense
– One look at an image provides lots of information
very rapidly – a picture is worth a thousand words
– You are inherently good at detecting patterns
visually
• Seeing an “outlier” in a table is very difficult, it is often
easy to see in a graph
• The real purpose is “insights”, getting higher
level summary that may be useful
Fun quotes to know
• The purpose of computing is insight, not
numbers. (Richard Hamming)
• The purpose of the numbers computed is not
yet in sight. (unknown computer simulation
expert)
What kinds of insights are useful?
• Confirmatory (often “corroborating” would be a better word)
– You already have a hypothesis, or expectation of a pattern, now you
look for corroborating evidence in NEW data
• Creating new hypotheses
– Observing patterns that serve as “research questions”: will this
pattern continue or repeat, is it present in other datasets (from
another time, another experiment, or another region)
– You CAN’T CONFIRM A HYPOTHESIS FROM THE SAME DATA THAT GAVE
IT
• This is a problem for example with climate science, since we only have one history.
We can only possibly confirm, with last 10 years of data, a hypothesis that was
made from similar data at least ten years ago. And for “climate” ten years is too
short anyway…
• So if I create a hypothesis from available data and publish it, then you go “confirm
it” in the same data, this is foundationally absolute nonsense.
• Recall that predictive models need three data sets: training, validation, and test
sets.
More about confirmatory evidence
• The only way to prove causation:
– You can turn factor A on/off. Every time you turn it on, soon
after B happens. This convinces you that “A on” causes B. Note
that this requires experiments with manipulation of A!
• Observational data (no manipulated variables) cannot
prove causation
– It can disprove it though. If B happened first, then it was not
caused by A. This is the foundation of “Granger causality”.
– If you are observing a game between intelligent opponents,
then B could anticipate moves by A and try to counter them
ahead of time. So the future opportunities of A can cause
anticipating reactions by B. This is not how natural phenomena
work, so Granger should be OK for us.
• In any case, a game should not be modeled by simple one-time-step
rules, but natural phenomena mostly obey such difference or
differential models. In other words, the simple concept of “causation”
is not appropriate for games between intelligent players.
So keep in mind…
• Statistical tests that claim to prove A affects B usually
prove no such thing at all. The causation comes from
your understanding of the world, and statistics helps
you convince others…
– It is a bit of black magic there
– For example, the p-value is this:
• Assuming given statistical distributions (often Gaussian normal)
and that your null hypothesis is correct, the probability that the
chosen statistic summarizing your observed data could be more
extreme than what it is.
• It takes a lot of magic to now say: small p clearly means causation.
Statistics only shows correlations… it can’t tell if A caused B or the
other way around, or if C influenced both A and B
So let’s get to visualizations…
• In Excel, Home >
Conditional formatting
allows seeing the numbers
in a table.
• You can also play with
Insert > Sparklines which
allows making tiny graphs
within cells
• Easy to spot smallest and
largest, get some
impression of distribution
Random numbers
0.731033584
0.806055053
0.988103245
0.809884417
0.756027069
0.462190297
0.910670142
0.906566945
0.501780587
0.181802984
0.659130022
0.32821301
0.111329819
0.617390297
0.252291447
0.990308253
0.274208995
0.614407383
0.298483381
0.526614001
0.01251721
Scatter plots in
Excel
• Illustration of
Simpson’s paradox
(from Wikipedia)
– Ignoring a factor
can give
completely wrong
trend
• Seppo’s paradox
– One single failed
experiment can
give high R2
Trouble with Excel
• Even making a plot showing Simpson’s paradox is difficult,
Excel does not allow to format the markers by some
factor…
• However, most people can manipulate data in Excel, do
some basic transformation, delete an outlier that would
spoil the analysis (i.e., a failed experiment)
– Statisticians can make up theories and criteria for what is an
outlier. For an experimentalist, if you trust the experiment, then
it can be an interesting special case… What counts is whether
the data is real or corrupted. So one persons outlier can be the
important special case for another.
• Remember to keep your raw data safe. Do the analysis,
including deleting outliers, in a separate file, preferably in a
separate folder altogether !
Pivot tables and charts
• These are excellent for inspecting effects of
multiple factors, especially when each factor
only has two or three levels
• Note: often you want to “paste special”
choosing “values”, maybe also “transpose”
– Copying formulas instead of values can be trouble
– Next page has a data table, explore it in Excel…
Data for pivoting
Starch Preproc Temp C A B E F awcrit minerr maxerr Mrcrit
cassava pregel 25 0.968875 -7.19313 4.136666 11.49491 4.369757 0.594448 -0.1752 0.267812 11.20288
cassava pregel 35 0.961012 -6.59564 4.80928 12.37645 3.90624 0.560223 -0.11034 0.200768 10.83981
cassava raw 25 1.084315 -9.35922 6.394428 11.54028 6.14054 0.584587 -0.15701 0.234026 12.88684
cassava raw 35 1.053811 -8.16377 6.670096 12.89617 5.238465 0.567919 -0.19874 0.207021 12.56244
mix pregel 25 0.970572 -7.23715 4.387437 12.13923 4.860418 0.661974 -0.18178 0.335665 12.89627
mix pregel 35 0.956754 -6.33248 5.169849 13.17968 3.939175 0.628266 -0.1169 0.189642 12.21953
mix raw 25 1.011224 -6.82134 7.067797 13.1484 5.482373 0.652614 -0.07943 0.114265 14.0632
mix raw 35 1.03495 -7.5668 6.548293 13.77577 4.80145 0.647217 -0.13632 0.238465 13.71736
rice pregel 25 0.976289 -7.87222 3.483088 11.59115 4.755083 0.649166 -0.13634 0.280165 12.27966
rice pregel 35 0.963513 -7.03764 4.010848 12.02805 3.972963 0.607599 -0.2066 0.231581 11.28119
rice raw 25 1.00602 -6.96414 7.248538 13.4071 5.88229 0.672905 -0.18423 0.139877 14.904
rice raw 35 1.023475 -7.41054 6.808834 13.77478 5.185373 0.655075 -0.20526 0.113226 14.20889
Reproduce this pivot chart…
• Note that you
can sort the
“axis fields”,
and this affects
the grouping
– You can select
a primary
comparison
How about fitting a model?
• There are very basic model options ready-made
as trendlines in Excel
• What you really have to do typically is this:
– Your inputs and targeted model output y are in
columns
– You guess starting values for model parameters,
calculate model output y~ with these
– For every data point you form squared error (y-y~)2
– Sum the column of squared errors, then minimize the
sum by using Data > Solver, which adjusts the model
parameters
Note about the basic solver
• There is a better option freely available for download,
search for DirectOptimizer (you need to install it as
add-in)
– It comes with a small manual that helps you get started
• The point
– If you need to fit Arrhenius law, or whatever other model
from physics or physical chemistry, then you pretty much
have to do “nonlinear least squares” fitting
• Even if there is a “linearizing transformation” the error sum gets
also transformed, and the results can be poor because of this
– much of the time you can do this in Excel…
Free statistics packages
• Check out JASP or JAMOVI
– The two are very similar, JASP has some special
Bayesian statistics that are unconventional
– Note again that while people think of Bayesian
probability as causation, NO statistical test actually
proves anything about causality! (Bayesian networks
are sometimes called “causal networks”, which sounds
good but is absolutely misleading. JASP doesn’t do
them though.)
• JAMOVI current version is 0.8.0.5
– It appears to get more frequent updates than JASP
Hands-on exploration of JAMOVI
• Basic functionality for
– Importing data
– Adjusting metadata on variables (type, levels)
– Inspecting basic statistics
– Plotting the correlation matrix
• Note
– You can’t get a matrix scatter plot of multiple
variables from Excel…
Iris data in JAMOVI
• It is easy to
generate
fancy plots
of how the
data are
distributed.
• However,
you can’t
create
classifiers in
JAMOVI…
Significances of correlations
Correlation Matrix
Sepal.Leng
th
Sepal.Wid
th
Petal.Leng
th
Petal.Wid
th
Sepal.Length — -0.118 0.872 *** 0.818 ***
Sepal.Width — -0.428 *** -0.366 ***
Petal.Length — 0.963 ***
Petal.Width —
Note. * p < .05, ** p < .01,
*** p < .001
Note: copy/paste to Word works well, not so well from JAMOVI to PowerPoint. I
used OneNote as intermediate to get this into PowerPoint… Less than perfect.
The point of correlations?
• IF some variable is assumed causal, then the
trends of effects are important
– B increases or decreases with the manipulated
variable A
• If two independently measured variables have
a high correlation, then neither is badly
corrupted by noise
– Correlation indicates there is mutual information,
a variable that carries no information about
anything else might as well be noise
• Pretty nice
matrix
scatterplot
from
JAMOVI
A first look at DataWarrior
• Current version 4.6.1 from Openmolecules.org
• Even if you run 64 bit Windows, take the 32
bit version – it can handle large enough data
sets
• This is a freely available professional quality
software package
– Too many features to cover… several tutorials are
available on YouTube
Iris data again
• I selected all columns in data view of JAMOVI,
copied, pasted to Excel, put back column
labels
• Then did “paste special” with headers to get
into DataWarrior
• In DataWarrior it is easy to assign marker color, size, etc., to a feature or
variable, so one plot can display multiple dimensions.
• 3-D scatterplots are easy to make a manipulate also…
What I encourage is this…
• Get yourself free software
– Then learning to use it is a safe investment, because
you are not cut off by fees or licensing
• The first thing to do with new data is to look at it.
Let the data guide you more than your own prior
assumptions.
– It is big effects that are important, you should be able
to see them
– Statistically significant differences almost always
emerge if you just collect enough samples – checking
significances is to a large part a ritual without much
meaning for practice
Conclusions
• Most people are handy with Excel and use it to collect and
manipulate data
– It has some ability for visualization, but very limited. See how far
it can take you… maybe it is enough
– It is good for transforming data by calculating new columns
• There is now free software for some basic exploratory
plotting and statistics
– JASP and JAMOVI appear convenient for a non-statistician
• For industry-strength visualizations DataWarrior is a free
desktop application
• None of the above is for learning classifiers or for doing
nonlinear regression… but you can do basic nonlinear
regression easily in Excel, with some manual labor
– Get DirectOptimizer add-in, at no cost

L8 scientific visualization of data

  • 1.
    Scientific visualization ofdata Research Methodology Seppo Karrila September 2017 (2560 Thai)
  • 2.
    Executive summary • Thisis about visualizing usually smallish sets of experimental data from lab, for use in science – This is NOT about beautiful impressive artistic and emotionally enticing “infographics” to average consumers • Emphasis is on insights, creating or corroborating hypotheses, or assessing research questions
  • 3.
    The purpose ofvisualization • Sight is your most important sense – One look at an image provides lots of information very rapidly – a picture is worth a thousand words – You are inherently good at detecting patterns visually • Seeing an “outlier” in a table is very difficult, it is often easy to see in a graph • The real purpose is “insights”, getting higher level summary that may be useful
  • 4.
    Fun quotes toknow • The purpose of computing is insight, not numbers. (Richard Hamming) • The purpose of the numbers computed is not yet in sight. (unknown computer simulation expert)
  • 5.
    What kinds ofinsights are useful? • Confirmatory (often “corroborating” would be a better word) – You already have a hypothesis, or expectation of a pattern, now you look for corroborating evidence in NEW data • Creating new hypotheses – Observing patterns that serve as “research questions”: will this pattern continue or repeat, is it present in other datasets (from another time, another experiment, or another region) – You CAN’T CONFIRM A HYPOTHESIS FROM THE SAME DATA THAT GAVE IT • This is a problem for example with climate science, since we only have one history. We can only possibly confirm, with last 10 years of data, a hypothesis that was made from similar data at least ten years ago. And for “climate” ten years is too short anyway… • So if I create a hypothesis from available data and publish it, then you go “confirm it” in the same data, this is foundationally absolute nonsense. • Recall that predictive models need three data sets: training, validation, and test sets.
  • 6.
    More about confirmatoryevidence • The only way to prove causation: – You can turn factor A on/off. Every time you turn it on, soon after B happens. This convinces you that “A on” causes B. Note that this requires experiments with manipulation of A! • Observational data (no manipulated variables) cannot prove causation – It can disprove it though. If B happened first, then it was not caused by A. This is the foundation of “Granger causality”. – If you are observing a game between intelligent opponents, then B could anticipate moves by A and try to counter them ahead of time. So the future opportunities of A can cause anticipating reactions by B. This is not how natural phenomena work, so Granger should be OK for us. • In any case, a game should not be modeled by simple one-time-step rules, but natural phenomena mostly obey such difference or differential models. In other words, the simple concept of “causation” is not appropriate for games between intelligent players.
  • 7.
    So keep inmind… • Statistical tests that claim to prove A affects B usually prove no such thing at all. The causation comes from your understanding of the world, and statistics helps you convince others… – It is a bit of black magic there – For example, the p-value is this: • Assuming given statistical distributions (often Gaussian normal) and that your null hypothesis is correct, the probability that the chosen statistic summarizing your observed data could be more extreme than what it is. • It takes a lot of magic to now say: small p clearly means causation. Statistics only shows correlations… it can’t tell if A caused B or the other way around, or if C influenced both A and B
  • 8.
    So let’s getto visualizations… • In Excel, Home > Conditional formatting allows seeing the numbers in a table. • You can also play with Insert > Sparklines which allows making tiny graphs within cells • Easy to spot smallest and largest, get some impression of distribution Random numbers 0.731033584 0.806055053 0.988103245 0.809884417 0.756027069 0.462190297 0.910670142 0.906566945 0.501780587 0.181802984 0.659130022 0.32821301 0.111329819 0.617390297 0.252291447 0.990308253 0.274208995 0.614407383 0.298483381 0.526614001 0.01251721
  • 9.
    Scatter plots in Excel •Illustration of Simpson’s paradox (from Wikipedia) – Ignoring a factor can give completely wrong trend • Seppo’s paradox – One single failed experiment can give high R2
  • 10.
    Trouble with Excel •Even making a plot showing Simpson’s paradox is difficult, Excel does not allow to format the markers by some factor… • However, most people can manipulate data in Excel, do some basic transformation, delete an outlier that would spoil the analysis (i.e., a failed experiment) – Statisticians can make up theories and criteria for what is an outlier. For an experimentalist, if you trust the experiment, then it can be an interesting special case… What counts is whether the data is real or corrupted. So one persons outlier can be the important special case for another. • Remember to keep your raw data safe. Do the analysis, including deleting outliers, in a separate file, preferably in a separate folder altogether !
  • 11.
    Pivot tables andcharts • These are excellent for inspecting effects of multiple factors, especially when each factor only has two or three levels • Note: often you want to “paste special” choosing “values”, maybe also “transpose” – Copying formulas instead of values can be trouble – Next page has a data table, explore it in Excel…
  • 12.
    Data for pivoting StarchPreproc Temp C A B E F awcrit minerr maxerr Mrcrit cassava pregel 25 0.968875 -7.19313 4.136666 11.49491 4.369757 0.594448 -0.1752 0.267812 11.20288 cassava pregel 35 0.961012 -6.59564 4.80928 12.37645 3.90624 0.560223 -0.11034 0.200768 10.83981 cassava raw 25 1.084315 -9.35922 6.394428 11.54028 6.14054 0.584587 -0.15701 0.234026 12.88684 cassava raw 35 1.053811 -8.16377 6.670096 12.89617 5.238465 0.567919 -0.19874 0.207021 12.56244 mix pregel 25 0.970572 -7.23715 4.387437 12.13923 4.860418 0.661974 -0.18178 0.335665 12.89627 mix pregel 35 0.956754 -6.33248 5.169849 13.17968 3.939175 0.628266 -0.1169 0.189642 12.21953 mix raw 25 1.011224 -6.82134 7.067797 13.1484 5.482373 0.652614 -0.07943 0.114265 14.0632 mix raw 35 1.03495 -7.5668 6.548293 13.77577 4.80145 0.647217 -0.13632 0.238465 13.71736 rice pregel 25 0.976289 -7.87222 3.483088 11.59115 4.755083 0.649166 -0.13634 0.280165 12.27966 rice pregel 35 0.963513 -7.03764 4.010848 12.02805 3.972963 0.607599 -0.2066 0.231581 11.28119 rice raw 25 1.00602 -6.96414 7.248538 13.4071 5.88229 0.672905 -0.18423 0.139877 14.904 rice raw 35 1.023475 -7.41054 6.808834 13.77478 5.185373 0.655075 -0.20526 0.113226 14.20889
  • 13.
    Reproduce this pivotchart… • Note that you can sort the “axis fields”, and this affects the grouping – You can select a primary comparison
  • 14.
    How about fittinga model? • There are very basic model options ready-made as trendlines in Excel • What you really have to do typically is this: – Your inputs and targeted model output y are in columns – You guess starting values for model parameters, calculate model output y~ with these – For every data point you form squared error (y-y~)2 – Sum the column of squared errors, then minimize the sum by using Data > Solver, which adjusts the model parameters
  • 15.
    Note about thebasic solver • There is a better option freely available for download, search for DirectOptimizer (you need to install it as add-in) – It comes with a small manual that helps you get started • The point – If you need to fit Arrhenius law, or whatever other model from physics or physical chemistry, then you pretty much have to do “nonlinear least squares” fitting • Even if there is a “linearizing transformation” the error sum gets also transformed, and the results can be poor because of this – much of the time you can do this in Excel…
  • 16.
    Free statistics packages •Check out JASP or JAMOVI – The two are very similar, JASP has some special Bayesian statistics that are unconventional – Note again that while people think of Bayesian probability as causation, NO statistical test actually proves anything about causality! (Bayesian networks are sometimes called “causal networks”, which sounds good but is absolutely misleading. JASP doesn’t do them though.) • JAMOVI current version is 0.8.0.5 – It appears to get more frequent updates than JASP
  • 17.
    Hands-on exploration ofJAMOVI • Basic functionality for – Importing data – Adjusting metadata on variables (type, levels) – Inspecting basic statistics – Plotting the correlation matrix • Note – You can’t get a matrix scatter plot of multiple variables from Excel…
  • 18.
    Iris data inJAMOVI • It is easy to generate fancy plots of how the data are distributed. • However, you can’t create classifiers in JAMOVI…
  • 19.
    Significances of correlations CorrelationMatrix Sepal.Leng th Sepal.Wid th Petal.Leng th Petal.Wid th Sepal.Length — -0.118 0.872 *** 0.818 *** Sepal.Width — -0.428 *** -0.366 *** Petal.Length — 0.963 *** Petal.Width — Note. * p < .05, ** p < .01, *** p < .001 Note: copy/paste to Word works well, not so well from JAMOVI to PowerPoint. I used OneNote as intermediate to get this into PowerPoint… Less than perfect.
  • 20.
    The point ofcorrelations? • IF some variable is assumed causal, then the trends of effects are important – B increases or decreases with the manipulated variable A • If two independently measured variables have a high correlation, then neither is badly corrupted by noise – Correlation indicates there is mutual information, a variable that carries no information about anything else might as well be noise
  • 21.
  • 22.
    A first lookat DataWarrior • Current version 4.6.1 from Openmolecules.org • Even if you run 64 bit Windows, take the 32 bit version – it can handle large enough data sets • This is a freely available professional quality software package – Too many features to cover… several tutorials are available on YouTube
  • 23.
    Iris data again •I selected all columns in data view of JAMOVI, copied, pasted to Excel, put back column labels • Then did “paste special” with headers to get into DataWarrior
  • 24.
    • In DataWarriorit is easy to assign marker color, size, etc., to a feature or variable, so one plot can display multiple dimensions.
  • 25.
    • 3-D scatterplotsare easy to make a manipulate also…
  • 26.
    What I encourageis this… • Get yourself free software – Then learning to use it is a safe investment, because you are not cut off by fees or licensing • The first thing to do with new data is to look at it. Let the data guide you more than your own prior assumptions. – It is big effects that are important, you should be able to see them – Statistically significant differences almost always emerge if you just collect enough samples – checking significances is to a large part a ritual without much meaning for practice
  • 27.
    Conclusions • Most peopleare handy with Excel and use it to collect and manipulate data – It has some ability for visualization, but very limited. See how far it can take you… maybe it is enough – It is good for transforming data by calculating new columns • There is now free software for some basic exploratory plotting and statistics – JASP and JAMOVI appear convenient for a non-statistician • For industry-strength visualizations DataWarrior is a free desktop application • None of the above is for learning classifiers or for doing nonlinear regression… but you can do basic nonlinear regression easily in Excel, with some manual labor – Get DirectOptimizer add-in, at no cost