SlideShare a Scribd company logo
1 of 27
Scientific visualization of data
Research Methodology
Seppo Karrila
September 2017 (2560 Thai)
Executive summary
• This is about visualizing usually smallish sets
of experimental data from lab, for use in
science
– This is NOT about beautiful impressive artistic and
emotionally enticing “infographics” to average
consumers
• Emphasis is on insights, creating or
corroborating hypotheses, or assessing
research questions
The purpose of visualization
• Sight is your most important sense
– One look at an image provides lots of information
very rapidly – a picture is worth a thousand words
– You are inherently good at detecting patterns
visually
• Seeing an “outlier” in a table is very difficult, it is often
easy to see in a graph
• The real purpose is “insights”, getting higher
level summary that may be useful
Fun quotes to know
• The purpose of computing is insight, not
numbers. (Richard Hamming)
• The purpose of the numbers computed is not
yet in sight. (unknown computer simulation
expert)
What kinds of insights are useful?
• Confirmatory (often “corroborating” would be a better word)
– You already have a hypothesis, or expectation of a pattern, now you
look for corroborating evidence in NEW data
• Creating new hypotheses
– Observing patterns that serve as “research questions”: will this
pattern continue or repeat, is it present in other datasets (from
another time, another experiment, or another region)
– You CAN’T CONFIRM A HYPOTHESIS FROM THE SAME DATA THAT GAVE
IT
• This is a problem for example with climate science, since we only have one history.
We can only possibly confirm, with last 10 years of data, a hypothesis that was
made from similar data at least ten years ago. And for “climate” ten years is too
short anyway…
• So if I create a hypothesis from available data and publish it, then you go “confirm
it” in the same data, this is foundationally absolute nonsense.
• Recall that predictive models need three data sets: training, validation, and test
sets.
More about confirmatory evidence
• The only way to prove causation:
– You can turn factor A on/off. Every time you turn it on, soon
after B happens. This convinces you that “A on” causes B. Note
that this requires experiments with manipulation of A!
• Observational data (no manipulated variables) cannot
prove causation
– It can disprove it though. If B happened first, then it was not
caused by A. This is the foundation of “Granger causality”.
– If you are observing a game between intelligent opponents,
then B could anticipate moves by A and try to counter them
ahead of time. So the future opportunities of A can cause
anticipating reactions by B. This is not how natural phenomena
work, so Granger should be OK for us.
• In any case, a game should not be modeled by simple one-time-step
rules, but natural phenomena mostly obey such difference or
differential models. In other words, the simple concept of “causation”
is not appropriate for games between intelligent players.
So keep in mind…
• Statistical tests that claim to prove A affects B usually
prove no such thing at all. The causation comes from
your understanding of the world, and statistics helps
you convince others…
– It is a bit of black magic there
– For example, the p-value is this:
• Assuming given statistical distributions (often Gaussian normal)
and that your null hypothesis is correct, the probability that the
chosen statistic summarizing your observed data could be more
extreme than what it is.
• It takes a lot of magic to now say: small p clearly means causation.
Statistics only shows correlations… it can’t tell if A caused B or the
other way around, or if C influenced both A and B
So let’s get to visualizations…
• In Excel, Home >
Conditional formatting
allows seeing the numbers
in a table.
• You can also play with
Insert > Sparklines which
allows making tiny graphs
within cells
• Easy to spot smallest and
largest, get some
impression of distribution
Random numbers
0.731033584
0.806055053
0.988103245
0.809884417
0.756027069
0.462190297
0.910670142
0.906566945
0.501780587
0.181802984
0.659130022
0.32821301
0.111329819
0.617390297
0.252291447
0.990308253
0.274208995
0.614407383
0.298483381
0.526614001
0.01251721
Scatter plots in
Excel
• Illustration of
Simpson’s paradox
(from Wikipedia)
– Ignoring a factor
can give
completely wrong
trend
• Seppo’s paradox
– One single failed
experiment can
give high R2
Trouble with Excel
• Even making a plot showing Simpson’s paradox is difficult,
Excel does not allow to format the markers by some
factor…
• However, most people can manipulate data in Excel, do
some basic transformation, delete an outlier that would
spoil the analysis (i.e., a failed experiment)
– Statisticians can make up theories and criteria for what is an
outlier. For an experimentalist, if you trust the experiment, then
it can be an interesting special case… What counts is whether
the data is real or corrupted. So one persons outlier can be the
important special case for another.
• Remember to keep your raw data safe. Do the analysis,
including deleting outliers, in a separate file, preferably in a
separate folder altogether !
Pivot tables and charts
• These are excellent for inspecting effects of
multiple factors, especially when each factor
only has two or three levels
• Note: often you want to “paste special”
choosing “values”, maybe also “transpose”
– Copying formulas instead of values can be trouble
– Next page has a data table, explore it in Excel…
Data for pivoting
Starch Preproc Temp C A B E F awcrit minerr maxerr Mrcrit
cassava pregel 25 0.968875 -7.19313 4.136666 11.49491 4.369757 0.594448 -0.1752 0.267812 11.20288
cassava pregel 35 0.961012 -6.59564 4.80928 12.37645 3.90624 0.560223 -0.11034 0.200768 10.83981
cassava raw 25 1.084315 -9.35922 6.394428 11.54028 6.14054 0.584587 -0.15701 0.234026 12.88684
cassava raw 35 1.053811 -8.16377 6.670096 12.89617 5.238465 0.567919 -0.19874 0.207021 12.56244
mix pregel 25 0.970572 -7.23715 4.387437 12.13923 4.860418 0.661974 -0.18178 0.335665 12.89627
mix pregel 35 0.956754 -6.33248 5.169849 13.17968 3.939175 0.628266 -0.1169 0.189642 12.21953
mix raw 25 1.011224 -6.82134 7.067797 13.1484 5.482373 0.652614 -0.07943 0.114265 14.0632
mix raw 35 1.03495 -7.5668 6.548293 13.77577 4.80145 0.647217 -0.13632 0.238465 13.71736
rice pregel 25 0.976289 -7.87222 3.483088 11.59115 4.755083 0.649166 -0.13634 0.280165 12.27966
rice pregel 35 0.963513 -7.03764 4.010848 12.02805 3.972963 0.607599 -0.2066 0.231581 11.28119
rice raw 25 1.00602 -6.96414 7.248538 13.4071 5.88229 0.672905 -0.18423 0.139877 14.904
rice raw 35 1.023475 -7.41054 6.808834 13.77478 5.185373 0.655075 -0.20526 0.113226 14.20889
Reproduce this pivot chart…
• Note that you
can sort the
“axis fields”,
and this affects
the grouping
– You can select
a primary
comparison
How about fitting a model?
• There are very basic model options ready-made
as trendlines in Excel
• What you really have to do typically is this:
– Your inputs and targeted model output y are in
columns
– You guess starting values for model parameters,
calculate model output y~ with these
– For every data point you form squared error (y-y~)2
– Sum the column of squared errors, then minimize the
sum by using Data > Solver, which adjusts the model
parameters
Note about the basic solver
• There is a better option freely available for download,
search for DirectOptimizer (you need to install it as
add-in)
– It comes with a small manual that helps you get started
• The point
– If you need to fit Arrhenius law, or whatever other model
from physics or physical chemistry, then you pretty much
have to do “nonlinear least squares” fitting
• Even if there is a “linearizing transformation” the error sum gets
also transformed, and the results can be poor because of this
– much of the time you can do this in Excel…
Free statistics packages
• Check out JASP or JAMOVI
– The two are very similar, JASP has some special
Bayesian statistics that are unconventional
– Note again that while people think of Bayesian
probability as causation, NO statistical test actually
proves anything about causality! (Bayesian networks
are sometimes called “causal networks”, which sounds
good but is absolutely misleading. JASP doesn’t do
them though.)
• JAMOVI current version is 0.8.0.5
– It appears to get more frequent updates than JASP
Hands-on exploration of JAMOVI
• Basic functionality for
– Importing data
– Adjusting metadata on variables (type, levels)
– Inspecting basic statistics
– Plotting the correlation matrix
• Note
– You can’t get a matrix scatter plot of multiple
variables from Excel…
Iris data in JAMOVI
• It is easy to
generate
fancy plots
of how the
data are
distributed.
• However,
you can’t
create
classifiers in
JAMOVI…
Significances of correlations
Correlation Matrix
Sepal.Leng
th
Sepal.Wid
th
Petal.Leng
th
Petal.Wid
th
Sepal.Length — -0.118 0.872 *** 0.818 ***
Sepal.Width — -0.428 *** -0.366 ***
Petal.Length — 0.963 ***
Petal.Width —
Note. * p < .05, ** p < .01,
*** p < .001
Note: copy/paste to Word works well, not so well from JAMOVI to PowerPoint. I
used OneNote as intermediate to get this into PowerPoint… Less than perfect.
The point of correlations?
• IF some variable is assumed causal, then the
trends of effects are important
– B increases or decreases with the manipulated
variable A
• If two independently measured variables have
a high correlation, then neither is badly
corrupted by noise
– Correlation indicates there is mutual information,
a variable that carries no information about
anything else might as well be noise
• Pretty nice
matrix
scatterplot
from
JAMOVI
A first look at DataWarrior
• Current version 4.6.1 from Openmolecules.org
• Even if you run 64 bit Windows, take the 32
bit version – it can handle large enough data
sets
• This is a freely available professional quality
software package
– Too many features to cover… several tutorials are
available on YouTube
Iris data again
• I selected all columns in data view of JAMOVI,
copied, pasted to Excel, put back column
labels
• Then did “paste special” with headers to get
into DataWarrior
• In DataWarrior it is easy to assign marker color, size, etc., to a feature or
variable, so one plot can display multiple dimensions.
• 3-D scatterplots are easy to make a manipulate also…
What I encourage is this…
• Get yourself free software
– Then learning to use it is a safe investment, because
you are not cut off by fees or licensing
• The first thing to do with new data is to look at it.
Let the data guide you more than your own prior
assumptions.
– It is big effects that are important, you should be able
to see them
– Statistically significant differences almost always
emerge if you just collect enough samples – checking
significances is to a large part a ritual without much
meaning for practice
Conclusions
• Most people are handy with Excel and use it to collect and
manipulate data
– It has some ability for visualization, but very limited. See how far
it can take you… maybe it is enough
– It is good for transforming data by calculating new columns
• There is now free software for some basic exploratory
plotting and statistics
– JASP and JAMOVI appear convenient for a non-statistician
• For industry-strength visualizations DataWarrior is a free
desktop application
• None of the above is for learning classifiers or for doing
nonlinear regression… but you can do basic nonlinear
regression easily in Excel, with some manual labor
– Get DirectOptimizer add-in, at no cost

More Related Content

What's hot

Week4 Ensure Analysis Is Accurate And Complete
Week4 Ensure Analysis Is Accurate And CompleteWeek4 Ensure Analysis Is Accurate And Complete
Week4 Ensure Analysis Is Accurate And Complete
hapy
 
Aed1222 lesson 4
Aed1222 lesson 4Aed1222 lesson 4
Aed1222 lesson 4
nurun2010
 
Introduction to the statistics project
Introduction to the statistics projectIntroduction to the statistics project
Introduction to the statistics project
pmakunja
 
Introduction To SPSS
Introduction To SPSSIntroduction To SPSS
Introduction To SPSS
Phi Jack
 
Bmgt 311 chapter_12
Bmgt 311 chapter_12Bmgt 311 chapter_12
Bmgt 311 chapter_12
Chris Lovett
 
Slayter on planning quant design for flc projects - may 2011
Slayter   on planning quant design for flc projects - may 2011Slayter   on planning quant design for flc projects - may 2011
Slayter on planning quant design for flc projects - may 2011
Elspeth Slayter
 
Lesson 10 rm psych stats & graphs 2013
Lesson 10   rm psych stats & graphs 2013Lesson 10   rm psych stats & graphs 2013
Lesson 10 rm psych stats & graphs 2013
coburgpsych
 

What's hot (20)

Week4 Ensure Analysis Is Accurate And Complete
Week4 Ensure Analysis Is Accurate And CompleteWeek4 Ensure Analysis Is Accurate And Complete
Week4 Ensure Analysis Is Accurate And Complete
 
6 Modelling Purposes
6 Modelling Purposes6 Modelling Purposes
6 Modelling Purposes
 
Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
 
Business Basic Statistics
Business Basic StatisticsBusiness Basic Statistics
Business Basic Statistics
 
Improving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxImproving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradox
 
Research Method for Business chapter 12
Research Method for Business chapter 12Research Method for Business chapter 12
Research Method for Business chapter 12
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Aed1222 lesson 4
Aed1222 lesson 4Aed1222 lesson 4
Aed1222 lesson 4
 
Introduction to the statistics project
Introduction to the statistics projectIntroduction to the statistics project
Introduction to the statistics project
 
Data visualization via Tableau solving an excel problem
Data visualization via Tableau solving an excel problemData visualization via Tableau solving an excel problem
Data visualization via Tableau solving an excel problem
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Introduction To SPSS
Introduction To SPSSIntroduction To SPSS
Introduction To SPSS
 
Bmgt 311 chapter_12
Bmgt 311 chapter_12Bmgt 311 chapter_12
Bmgt 311 chapter_12
 
Tqm new tools
Tqm new toolsTqm new tools
Tqm new tools
 
Data analysis &amp; interpretation
Data analysis &amp; interpretationData analysis &amp; interpretation
Data analysis &amp; interpretation
 
Using SPSS: A Tutorial
Using SPSS: A TutorialUsing SPSS: A Tutorial
Using SPSS: A Tutorial
 
Slayter on planning quant design for flc projects - may 2011
Slayter   on planning quant design for flc projects - may 2011Slayter   on planning quant design for flc projects - may 2011
Slayter on planning quant design for flc projects - may 2011
 
Decision tree
Decision treeDecision tree
Decision tree
 
Statistical Power
Statistical PowerStatistical Power
Statistical Power
 
Lesson 10 rm psych stats & graphs 2013
Lesson 10   rm psych stats & graphs 2013Lesson 10   rm psych stats & graphs 2013
Lesson 10 rm psych stats & graphs 2013
 

Similar to L8 scientific visualization of data

Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0
lee_anderson40
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 

Similar to L8 scientific visualization of data (20)

Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Analysing The Data
Analysing The DataAnalysing The Data
Analysing The Data
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0Data Analysis Toolkit_Final v1.0
Data Analysis Toolkit_Final v1.0
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
 
Spss basics
Spss basicsSpss basics
Spss basics
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
CHAPTER 7.pptx
CHAPTER 7.pptxCHAPTER 7.pptx
CHAPTER 7.pptx
 
03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf03-Data-Analysis-Final.pdf
03-Data-Analysis-Final.pdf
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Introduction - Using Stata
Introduction - Using StataIntroduction - Using Stata
Introduction - Using Stata
 
OLD SEVEN TOOLS OF QUALTIY MANAGEMENT
OLD SEVEN TOOLS OF QUALTIY MANAGEMENTOLD SEVEN TOOLS OF QUALTIY MANAGEMENT
OLD SEVEN TOOLS OF QUALTIY MANAGEMENT
 

More from Seppo Karrila

More from Seppo Karrila (12)

L5 format and substance of thesis
L5 format and substance of thesisL5 format and substance of thesis
L5 format and substance of thesis
 
L4 research proposal
L4 research proposalL4 research proposal
L4 research proposal
 
L3 hypothesis or research question
L3 hypothesis or research questionL3 hypothesis or research question
L3 hypothesis or research question
 
How to run a meeting
How to run a meetingHow to run a meeting
How to run a meeting
 
On practical philosophy of research in science and technology
On practical philosophy of research in science and technologyOn practical philosophy of research in science and technology
On practical philosophy of research in science and technology
 
Lecture3 elementary optimization
Lecture3 elementary optimizationLecture3 elementary optimization
Lecture3 elementary optimization
 
Scale-up and scale-down of chemical processes
Scale-up and scale-down of chemical processesScale-up and scale-down of chemical processes
Scale-up and scale-down of chemical processes
 
About your graduate studies part 2
About your graduate studies part 2About your graduate studies part 2
About your graduate studies part 2
 
About your graduate studies part 1
About your graduate studies part 1About your graduate studies part 1
About your graduate studies part 1
 
Projects, promotions, and the Peter principle
Projects, promotions, and the Peter principleProjects, promotions, and the Peter principle
Projects, promotions, and the Peter principle
 
Selecting experimental variables for response surface modeling
Selecting experimental variables for response surface modelingSelecting experimental variables for response surface modeling
Selecting experimental variables for response surface modeling
 
How to review a journal paper and prepare oral presentation
How to review a journal paper and prepare oral presentationHow to review a journal paper and prepare oral presentation
How to review a journal paper and prepare oral presentation
 

Recently uploaded

SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
Krashi Coaching
 

Recently uploaded (20)

Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading RoomImplanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Benefits and Challenges of OER by Shweta Babel.pptx
Benefits and Challenges of OER by Shweta Babel.pptxBenefits and Challenges of OER by Shweta Babel.pptx
Benefits and Challenges of OER by Shweta Babel.pptx
 
Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17Features of Video Calls in the Discuss Module in Odoo 17
Features of Video Calls in the Discuss Module in Odoo 17
 
How to Manage Closest Location in Odoo 17 Inventory
How to Manage Closest Location in Odoo 17 InventoryHow to Manage Closest Location in Odoo 17 Inventory
How to Manage Closest Location in Odoo 17 Inventory
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17
 

L8 scientific visualization of data

  • 1. Scientific visualization of data Research Methodology Seppo Karrila September 2017 (2560 Thai)
  • 2. Executive summary • This is about visualizing usually smallish sets of experimental data from lab, for use in science – This is NOT about beautiful impressive artistic and emotionally enticing “infographics” to average consumers • Emphasis is on insights, creating or corroborating hypotheses, or assessing research questions
  • 3. The purpose of visualization • Sight is your most important sense – One look at an image provides lots of information very rapidly – a picture is worth a thousand words – You are inherently good at detecting patterns visually • Seeing an “outlier” in a table is very difficult, it is often easy to see in a graph • The real purpose is “insights”, getting higher level summary that may be useful
  • 4. Fun quotes to know • The purpose of computing is insight, not numbers. (Richard Hamming) • The purpose of the numbers computed is not yet in sight. (unknown computer simulation expert)
  • 5. What kinds of insights are useful? • Confirmatory (often “corroborating” would be a better word) – You already have a hypothesis, or expectation of a pattern, now you look for corroborating evidence in NEW data • Creating new hypotheses – Observing patterns that serve as “research questions”: will this pattern continue or repeat, is it present in other datasets (from another time, another experiment, or another region) – You CAN’T CONFIRM A HYPOTHESIS FROM THE SAME DATA THAT GAVE IT • This is a problem for example with climate science, since we only have one history. We can only possibly confirm, with last 10 years of data, a hypothesis that was made from similar data at least ten years ago. And for “climate” ten years is too short anyway… • So if I create a hypothesis from available data and publish it, then you go “confirm it” in the same data, this is foundationally absolute nonsense. • Recall that predictive models need three data sets: training, validation, and test sets.
  • 6. More about confirmatory evidence • The only way to prove causation: – You can turn factor A on/off. Every time you turn it on, soon after B happens. This convinces you that “A on” causes B. Note that this requires experiments with manipulation of A! • Observational data (no manipulated variables) cannot prove causation – It can disprove it though. If B happened first, then it was not caused by A. This is the foundation of “Granger causality”. – If you are observing a game between intelligent opponents, then B could anticipate moves by A and try to counter them ahead of time. So the future opportunities of A can cause anticipating reactions by B. This is not how natural phenomena work, so Granger should be OK for us. • In any case, a game should not be modeled by simple one-time-step rules, but natural phenomena mostly obey such difference or differential models. In other words, the simple concept of “causation” is not appropriate for games between intelligent players.
  • 7. So keep in mind… • Statistical tests that claim to prove A affects B usually prove no such thing at all. The causation comes from your understanding of the world, and statistics helps you convince others… – It is a bit of black magic there – For example, the p-value is this: • Assuming given statistical distributions (often Gaussian normal) and that your null hypothesis is correct, the probability that the chosen statistic summarizing your observed data could be more extreme than what it is. • It takes a lot of magic to now say: small p clearly means causation. Statistics only shows correlations… it can’t tell if A caused B or the other way around, or if C influenced both A and B
  • 8. So let’s get to visualizations… • In Excel, Home > Conditional formatting allows seeing the numbers in a table. • You can also play with Insert > Sparklines which allows making tiny graphs within cells • Easy to spot smallest and largest, get some impression of distribution Random numbers 0.731033584 0.806055053 0.988103245 0.809884417 0.756027069 0.462190297 0.910670142 0.906566945 0.501780587 0.181802984 0.659130022 0.32821301 0.111329819 0.617390297 0.252291447 0.990308253 0.274208995 0.614407383 0.298483381 0.526614001 0.01251721
  • 9. Scatter plots in Excel • Illustration of Simpson’s paradox (from Wikipedia) – Ignoring a factor can give completely wrong trend • Seppo’s paradox – One single failed experiment can give high R2
  • 10. Trouble with Excel • Even making a plot showing Simpson’s paradox is difficult, Excel does not allow to format the markers by some factor… • However, most people can manipulate data in Excel, do some basic transformation, delete an outlier that would spoil the analysis (i.e., a failed experiment) – Statisticians can make up theories and criteria for what is an outlier. For an experimentalist, if you trust the experiment, then it can be an interesting special case… What counts is whether the data is real or corrupted. So one persons outlier can be the important special case for another. • Remember to keep your raw data safe. Do the analysis, including deleting outliers, in a separate file, preferably in a separate folder altogether !
  • 11. Pivot tables and charts • These are excellent for inspecting effects of multiple factors, especially when each factor only has two or three levels • Note: often you want to “paste special” choosing “values”, maybe also “transpose” – Copying formulas instead of values can be trouble – Next page has a data table, explore it in Excel…
  • 12. Data for pivoting Starch Preproc Temp C A B E F awcrit minerr maxerr Mrcrit cassava pregel 25 0.968875 -7.19313 4.136666 11.49491 4.369757 0.594448 -0.1752 0.267812 11.20288 cassava pregel 35 0.961012 -6.59564 4.80928 12.37645 3.90624 0.560223 -0.11034 0.200768 10.83981 cassava raw 25 1.084315 -9.35922 6.394428 11.54028 6.14054 0.584587 -0.15701 0.234026 12.88684 cassava raw 35 1.053811 -8.16377 6.670096 12.89617 5.238465 0.567919 -0.19874 0.207021 12.56244 mix pregel 25 0.970572 -7.23715 4.387437 12.13923 4.860418 0.661974 -0.18178 0.335665 12.89627 mix pregel 35 0.956754 -6.33248 5.169849 13.17968 3.939175 0.628266 -0.1169 0.189642 12.21953 mix raw 25 1.011224 -6.82134 7.067797 13.1484 5.482373 0.652614 -0.07943 0.114265 14.0632 mix raw 35 1.03495 -7.5668 6.548293 13.77577 4.80145 0.647217 -0.13632 0.238465 13.71736 rice pregel 25 0.976289 -7.87222 3.483088 11.59115 4.755083 0.649166 -0.13634 0.280165 12.27966 rice pregel 35 0.963513 -7.03764 4.010848 12.02805 3.972963 0.607599 -0.2066 0.231581 11.28119 rice raw 25 1.00602 -6.96414 7.248538 13.4071 5.88229 0.672905 -0.18423 0.139877 14.904 rice raw 35 1.023475 -7.41054 6.808834 13.77478 5.185373 0.655075 -0.20526 0.113226 14.20889
  • 13. Reproduce this pivot chart… • Note that you can sort the “axis fields”, and this affects the grouping – You can select a primary comparison
  • 14. How about fitting a model? • There are very basic model options ready-made as trendlines in Excel • What you really have to do typically is this: – Your inputs and targeted model output y are in columns – You guess starting values for model parameters, calculate model output y~ with these – For every data point you form squared error (y-y~)2 – Sum the column of squared errors, then minimize the sum by using Data > Solver, which adjusts the model parameters
  • 15. Note about the basic solver • There is a better option freely available for download, search for DirectOptimizer (you need to install it as add-in) – It comes with a small manual that helps you get started • The point – If you need to fit Arrhenius law, or whatever other model from physics or physical chemistry, then you pretty much have to do “nonlinear least squares” fitting • Even if there is a “linearizing transformation” the error sum gets also transformed, and the results can be poor because of this – much of the time you can do this in Excel…
  • 16. Free statistics packages • Check out JASP or JAMOVI – The two are very similar, JASP has some special Bayesian statistics that are unconventional – Note again that while people think of Bayesian probability as causation, NO statistical test actually proves anything about causality! (Bayesian networks are sometimes called “causal networks”, which sounds good but is absolutely misleading. JASP doesn’t do them though.) • JAMOVI current version is 0.8.0.5 – It appears to get more frequent updates than JASP
  • 17. Hands-on exploration of JAMOVI • Basic functionality for – Importing data – Adjusting metadata on variables (type, levels) – Inspecting basic statistics – Plotting the correlation matrix • Note – You can’t get a matrix scatter plot of multiple variables from Excel…
  • 18. Iris data in JAMOVI • It is easy to generate fancy plots of how the data are distributed. • However, you can’t create classifiers in JAMOVI…
  • 19. Significances of correlations Correlation Matrix Sepal.Leng th Sepal.Wid th Petal.Leng th Petal.Wid th Sepal.Length — -0.118 0.872 *** 0.818 *** Sepal.Width — -0.428 *** -0.366 *** Petal.Length — 0.963 *** Petal.Width — Note. * p < .05, ** p < .01, *** p < .001 Note: copy/paste to Word works well, not so well from JAMOVI to PowerPoint. I used OneNote as intermediate to get this into PowerPoint… Less than perfect.
  • 20. The point of correlations? • IF some variable is assumed causal, then the trends of effects are important – B increases or decreases with the manipulated variable A • If two independently measured variables have a high correlation, then neither is badly corrupted by noise – Correlation indicates there is mutual information, a variable that carries no information about anything else might as well be noise
  • 22. A first look at DataWarrior • Current version 4.6.1 from Openmolecules.org • Even if you run 64 bit Windows, take the 32 bit version – it can handle large enough data sets • This is a freely available professional quality software package – Too many features to cover… several tutorials are available on YouTube
  • 23. Iris data again • I selected all columns in data view of JAMOVI, copied, pasted to Excel, put back column labels • Then did “paste special” with headers to get into DataWarrior
  • 24. • In DataWarrior it is easy to assign marker color, size, etc., to a feature or variable, so one plot can display multiple dimensions.
  • 25. • 3-D scatterplots are easy to make a manipulate also…
  • 26. What I encourage is this… • Get yourself free software – Then learning to use it is a safe investment, because you are not cut off by fees or licensing • The first thing to do with new data is to look at it. Let the data guide you more than your own prior assumptions. – It is big effects that are important, you should be able to see them – Statistically significant differences almost always emerge if you just collect enough samples – checking significances is to a large part a ritual without much meaning for practice
  • 27. Conclusions • Most people are handy with Excel and use it to collect and manipulate data – It has some ability for visualization, but very limited. See how far it can take you… maybe it is enough – It is good for transforming data by calculating new columns • There is now free software for some basic exploratory plotting and statistics – JASP and JAMOVI appear convenient for a non-statistician • For industry-strength visualizations DataWarrior is a free desktop application • None of the above is for learning classifiers or for doing nonlinear regression… but you can do basic nonlinear regression easily in Excel, with some manual labor – Get DirectOptimizer add-in, at no cost