SlideShare a Scribd company logo
1 of 22
Download to read offline
Data Science and some
(classical) challenges
Søren Højsgaard
Department of Mathematical Sciences
Aalborg University
Prelude
• Data science and statistics – somewhat synonymous ??
• Getting large datasets from different sources organised
the right way is not something statisticians are typically
good at.
• Some of the difficulties / pitfalls that has ”haunted”
statistics for 100 years still exist in modern data science
• Many of these topics are not taught to (stats) students
anymore; probably not to data scientists either…
• Just want to mention a few…
bigData versus smallData
• Q: What can we do with bigData that we can
not do with smallData?
• A: Huh – well – We can explore; seek
inspiration for further analyses; look for
patterns in data…
”You should always enjoy your empirical
findings – because you will (probably) never
see them again…”
Finding patterns in data…
More patterns…
A classic: treatment and control
• Test medicine against high blood pressure
A classic: treatment and control
• Hypotheses:
– Scientific hypothesis: There is an effect of treatment.
– Statistical hypothesis:
• Null hypothesis: H0: ”There is no effect of treatment”.
• Alternativ hypothesis: HA: ”There is an effect of treatment”
• To investigate hypothesis we conduct a study:
– Take 100 randomly selected high-blood-presure-
patients from a population.
– Allocate 50 to treatment and 50 to control (placebo).
• Notice: First we ask questions, then we collect
data to answer them…
How to proceed
• Simple approach:
– Measure blood pressure after 4 weeks
(or measure before start and after 4
weeks and calculate difference within
each patient).
– Compare group means of treatment
and control groups with t-test or
similar.
• There are many alternatives - not to
be discussed
Treatment
Control
Traditional thinking in statistics
• We have a theory about
the world (a map).
• We have data (landscape).
• We don’t use data for
”proving” that we are
right; we use data for
”proving” that we are
wrong…
”If the map does not match the landscape then
the map is wrong…”
Traditional thinking…
• To reject H0 (”No effect of treatment”) is the
strong conclusion of an experiment.
• If we see a significant difference it is
(probably) because it is for real.
• Not rejecting H0 is not the same as accepting
H0 as ”the truth”.
”Absence of evidence of an effect is not the
same as evidence of an absence of an effect…”
Sample size…
• We could well end up not rejection H0
because of a small sample size
• With bigData we are more likely to discover a
treatment effect –
if it is there,
• bigData reduces
uncertainty…
• Thumbs up!!!
Allocation mechanishm is important
• What if – accidently – most control patients were old
and most treatment patients were young?
• A treatment effect would be confounded with age: Can
not say whether difference is due to treatment or age.
• No amount of bigData helps us here - thumbs down!!!
• Need to collect data the right way – and understand
how it is collected.
• Avoid this e.g. by randomization:
– Allocation is independent of characteristics of patients.
– 50-50 chance of patient going to treatment or control
group.
Example: Is organic food healthier than
conventional food ?
• Take two fields; grow organic carrots
on one field, conventional carrots on
the other.
• Feed carrots to pigs/rats
• Register e.g. reproductive ability.
• Will bigData help us detect a
difference?
• And - what is bigData here?
• More pigs/rats?
Conventional
Organic
The point is…
• Notice: This is a plant experiment –
NOT an animal experiment.
• The relevant source of variation is the
field-to-field variation; not the animal-
to-animal variation
• Since we have one field only per
treatment we have no replications
• Increasing number of pigs/rats per
field will give a larger dataset (bigData)
• But no information about relevant
source of variation.
Conventional
Organic
A remedy…
• The field-to-field
variation is the relevant
one.
• So – need more fields
per treatment
Conventional
Organic
Conventional
Organic
Conventional
Organic
• Very common pitfall
• Need to understand
sources of variation
• Data has to be ”big” in
the right way.
Somewhat scary experience from
previous job
• Denmark has long tradition for systematic
collection of data in dairy industry.
• Big data is big issue; automatic collection of
loads of online data on individual cows.
• Dairy industry gets lots out of their databases
– but they could get more…
• Cattle people [CP]: We have this large database
and would like to use it better:
• Data analyst / statistician [DA]: Excellent idea.
• [CP]: We think that data science / machine
learning / pattern recognition / … can help us.
• [DA]: Probably; what specifically are you trying
to obtain?
• [CP]: Well, we want to improve reproduction
and reduce lameness and mastitis problems
• [DA]: Interesting; when you say ”improve
reproduction” what do you have in mind.
• [CP]: What do you mean?
• [DA]: We have to be more specific about
where we want to go…
• [CP]: But we thought that modern data science
could help us here…
• [DA]: Huh – yes and no…
Speaking of data science
Speaking of …
• R (www.r-project.org) is the statistics system
language of choice for most statisticians
• Aalborg University hosts useR!2015
http://user2015.math.aau.dk/
• 700 statisticians / data scientists meet in
Aalborg on June 30. – July 3.
• There is room for more – only hotels are
limited..
We see patterns everywhere…

More Related Content

What's hot

Clinical prediction models: development, validation and beyond
Clinical prediction models:development, validation and beyondClinical prediction models:development, validation and beyond
Clinical prediction models: development, validation and beyondMaarten van Smeden
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?Paul Agapow
 
Machine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgeMachine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgePaul Agapow
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)Troy Magennis
 
The basics of prediction modeling
The basics of prediction modeling The basics of prediction modeling
The basics of prediction modeling Maarten van Smeden
 
vitagramma Intelligence CRM
vitagramma Intelligence CRMvitagramma Intelligence CRM
vitagramma Intelligence CRMKaushik Shitole
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit
 
Introduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIIntroduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIMaarten van Smeden
 
I love the smell of data in the morning (getting started with data science) ...
I love the smell of data in the morning (getting started with data science)  ...I love the smell of data in the morning (getting started with data science)  ...
I love the smell of data in the morning (getting started with data science) ...Troy Magennis
 
Database problems
Database problemsDatabase problems
Database problemsfelixkcp
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianLaure Wynants
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientificRevenue
 

What's hot (16)

Clinical prediction models: development, validation and beyond
Clinical prediction models:development, validation and beyondClinical prediction models:development, validation and beyond
Clinical prediction models: development, validation and beyond
 
The End of the Drug Development Casino?
The End of the Drug Development Casino?The End of the Drug Development Casino?
The End of the Drug Development Casino?
 
Machine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledgeMachine learning, health data & the limits of knowledge
Machine learning, health data & the limits of knowledge
 
Can your Lab compete?
Can your Lab compete? Can your Lab compete?
Can your Lab compete?
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)
 
The basics of prediction modeling
The basics of prediction modeling The basics of prediction modeling
The basics of prediction modeling
 
vitagramma Intelligence CRM
vitagramma Intelligence CRMvitagramma Intelligence CRM
vitagramma Intelligence CRM
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey Stella
 
Introduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIIntroduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part II
 
I love the smell of data in the morning (getting started with data science) ...
I love the smell of data in the morning (getting started with data science)  ...I love the smell of data in the morning (getting started with data science)  ...
I love the smell of data in the morning (getting started with data science) ...
 
O1
O1O1
O1
 
Database problems
Database problemsDatabase problems
Database problems
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 

Similar to Data Science Challenges Like Treatment Effects and Study Design

Statistics for IB Biology
Statistics for IB BiologyStatistics for IB Biology
Statistics for IB BiologyEran Earland
 
Online Course: Real Statistics: A Radical Approach
Online Course: Real Statistics: A Radical ApproachOnline Course: Real Statistics: A Radical Approach
Online Course: Real Statistics: A Radical ApproachAsad Zaman
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostelloData Con LA
 
Tico facts lesson_eva
Tico facts lesson_evaTico facts lesson_eva
Tico facts lesson_evaNTHU_TICO
 
Social Graphs for Better Drug Development
Social Graphs for Better Drug DevelopmentSocial Graphs for Better Drug Development
Social Graphs for Better Drug DevelopmentVaticle
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...Ruth Kearney
 
Presentation1a paul carpenter
Presentation1a paul carpenterPresentation1a paul carpenter
Presentation1a paul carpenterYinglingV
 
"Using Data Science to Design Effective Precision Preventative Behavioral Med...
"Using Data Science to Design Effective Precision Preventative Behavioral Med..."Using Data Science to Design Effective Precision Preventative Behavioral Med...
"Using Data Science to Design Effective Precision Preventative Behavioral Med...Hyper Wellbeing
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drugSean Ekins
 
Stats Workshop2010
Stats Workshop2010Stats Workshop2010
Stats Workshop2010anesah
 
Crisis of confidence, p-hacking and the future of psychology
Crisis of confidence, p-hacking and the future of psychologyCrisis of confidence, p-hacking and the future of psychology
Crisis of confidence, p-hacking and the future of psychologyMatti Heino
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trustPaul Agapow
 
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...Bedirhan Ustun
 
In search of the lost loss function
In search of the lost loss function In search of the lost loss function
In search of the lost loss function Stephen Senn
 
Torturing numbers - Descriptive Statistics for Growers (2013)
Torturing numbers - Descriptive Statistics for Growers (2013)Torturing numbers - Descriptive Statistics for Growers (2013)
Torturing numbers - Descriptive Statistics for Growers (2013)jasondeveau
 

Similar to Data Science Challenges Like Treatment Effects and Study Design (20)

Statistics for IB Biology
Statistics for IB BiologyStatistics for IB Biology
Statistics for IB Biology
 
intro_big_data.pptx
intro_big_data.pptxintro_big_data.pptx
intro_big_data.pptx
 
Online Course: Real Statistics: A Radical Approach
Online Course: Real Statistics: A Radical ApproachOnline Course: Real Statistics: A Radical Approach
Online Course: Real Statistics: A Radical Approach
 
Jsm big-data
Jsm big-dataJsm big-data
Jsm big-data
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostello
 
Tico facts lesson_eva
Tico facts lesson_evaTico facts lesson_eva
Tico facts lesson_eva
 
Social Graphs for Better Drug Development
Social Graphs for Better Drug DevelopmentSocial Graphs for Better Drug Development
Social Graphs for Better Drug Development
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
 
Presentation1a paul carpenter
Presentation1a paul carpenterPresentation1a paul carpenter
Presentation1a paul carpenter
 
What is research
What is researchWhat is research
What is research
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Sti2018 jws
Sti2018 jwsSti2018 jws
Sti2018 jws
 
"Using Data Science to Design Effective Precision Preventative Behavioral Med...
"Using Data Science to Design Effective Precision Preventative Behavioral Med..."Using Data Science to Design Effective Precision Preventative Behavioral Med...
"Using Data Science to Design Effective Precision Preventative Behavioral Med...
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drug
 
Stats Workshop2010
Stats Workshop2010Stats Workshop2010
Stats Workshop2010
 
Crisis of confidence, p-hacking and the future of psychology
Crisis of confidence, p-hacking and the future of psychologyCrisis of confidence, p-hacking and the future of psychology
Crisis of confidence, p-hacking and the future of psychology
 
ML, biomedical data & trust
ML, biomedical data & trustML, biomedical data & trust
ML, biomedical data & trust
 
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...
 
In search of the lost loss function
In search of the lost loss function In search of the lost loss function
In search of the lost loss function
 
Torturing numbers - Descriptive Statistics for Growers (2013)
Torturing numbers - Descriptive Statistics for Growers (2013)Torturing numbers - Descriptive Statistics for Growers (2013)
Torturing numbers - Descriptive Statistics for Growers (2013)
 

More from InfinIT - Innovationsnetværket for it

More from InfinIT - Innovationsnetværket for it (20)

Erfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermarkErfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermark
 
Object orientering, test driven development og c
Object orientering, test driven development og cObject orientering, test driven development og c
Object orientering, test driven development og c
 
Embedded softwaredevelopment hcs
Embedded softwaredevelopment hcsEmbedded softwaredevelopment hcs
Embedded softwaredevelopment hcs
 
C og c++-jens lund jensen
C og c++-jens lund jensenC og c++-jens lund jensen
C og c++-jens lund jensen
 
201811xx foredrag c_cpp
201811xx foredrag c_cpp201811xx foredrag c_cpp
201811xx foredrag c_cpp
 
C som-programmeringssprog-bt
C som-programmeringssprog-btC som-programmeringssprog-bt
C som-programmeringssprog-bt
 
Infinit seminar 060918
Infinit seminar 060918Infinit seminar 060918
Infinit seminar 060918
 
DCR solutions
DCR solutionsDCR solutions
DCR solutions
 
Not your grandfathers BPM
Not your grandfathers BPMNot your grandfathers BPM
Not your grandfathers BPM
 
Kmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolutionKmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolution
 
EcoKnow - oplæg
EcoKnow - oplægEcoKnow - oplæg
EcoKnow - oplæg
 
Martin Wickins Chatbots i fronten
Martin Wickins Chatbots i frontenMartin Wickins Chatbots i fronten
Martin Wickins Chatbots i fronten
 
Marie Fenger ai kundeservice
Marie Fenger ai kundeserviceMarie Fenger ai kundeservice
Marie Fenger ai kundeservice
 
Mads Kaysen SupWiz
Mads Kaysen SupWizMads Kaysen SupWiz
Mads Kaysen SupWiz
 
Leif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support CenterLeif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support Center
 
Jan Neerbek NLP og Chatbots
Jan Neerbek NLP og ChatbotsJan Neerbek NLP og Chatbots
Jan Neerbek NLP og Chatbots
 
Anders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer SupportAnders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer Support
 
Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018
 
Innovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekterInnovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekter
 
Rokoko infin it presentation
Rokoko infin it presentation Rokoko infin it presentation
Rokoko infin it presentation
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Data Science Challenges Like Treatment Effects and Study Design

  • 1. Data Science and some (classical) challenges Søren Højsgaard Department of Mathematical Sciences Aalborg University
  • 2. Prelude • Data science and statistics – somewhat synonymous ?? • Getting large datasets from different sources organised the right way is not something statisticians are typically good at. • Some of the difficulties / pitfalls that has ”haunted” statistics for 100 years still exist in modern data science • Many of these topics are not taught to (stats) students anymore; probably not to data scientists either… • Just want to mention a few…
  • 3. bigData versus smallData • Q: What can we do with bigData that we can not do with smallData? • A: Huh – well – We can explore; seek inspiration for further analyses; look for patterns in data… ”You should always enjoy your empirical findings – because you will (probably) never see them again…”
  • 6. A classic: treatment and control • Test medicine against high blood pressure
  • 7. A classic: treatment and control • Hypotheses: – Scientific hypothesis: There is an effect of treatment. – Statistical hypothesis: • Null hypothesis: H0: ”There is no effect of treatment”. • Alternativ hypothesis: HA: ”There is an effect of treatment” • To investigate hypothesis we conduct a study: – Take 100 randomly selected high-blood-presure- patients from a population. – Allocate 50 to treatment and 50 to control (placebo). • Notice: First we ask questions, then we collect data to answer them…
  • 8. How to proceed • Simple approach: – Measure blood pressure after 4 weeks (or measure before start and after 4 weeks and calculate difference within each patient). – Compare group means of treatment and control groups with t-test or similar. • There are many alternatives - not to be discussed Treatment Control
  • 9. Traditional thinking in statistics • We have a theory about the world (a map). • We have data (landscape). • We don’t use data for ”proving” that we are right; we use data for ”proving” that we are wrong… ”If the map does not match the landscape then the map is wrong…”
  • 10. Traditional thinking… • To reject H0 (”No effect of treatment”) is the strong conclusion of an experiment. • If we see a significant difference it is (probably) because it is for real. • Not rejecting H0 is not the same as accepting H0 as ”the truth”. ”Absence of evidence of an effect is not the same as evidence of an absence of an effect…”
  • 11. Sample size… • We could well end up not rejection H0 because of a small sample size • With bigData we are more likely to discover a treatment effect – if it is there, • bigData reduces uncertainty… • Thumbs up!!!
  • 12. Allocation mechanishm is important • What if – accidently – most control patients were old and most treatment patients were young? • A treatment effect would be confounded with age: Can not say whether difference is due to treatment or age. • No amount of bigData helps us here - thumbs down!!! • Need to collect data the right way – and understand how it is collected. • Avoid this e.g. by randomization: – Allocation is independent of characteristics of patients. – 50-50 chance of patient going to treatment or control group.
  • 13. Example: Is organic food healthier than conventional food ? • Take two fields; grow organic carrots on one field, conventional carrots on the other. • Feed carrots to pigs/rats • Register e.g. reproductive ability. • Will bigData help us detect a difference? • And - what is bigData here? • More pigs/rats? Conventional Organic
  • 14. The point is… • Notice: This is a plant experiment – NOT an animal experiment. • The relevant source of variation is the field-to-field variation; not the animal- to-animal variation • Since we have one field only per treatment we have no replications • Increasing number of pigs/rats per field will give a larger dataset (bigData) • But no information about relevant source of variation. Conventional Organic
  • 15. A remedy… • The field-to-field variation is the relevant one. • So – need more fields per treatment Conventional Organic Conventional Organic Conventional Organic • Very common pitfall • Need to understand sources of variation • Data has to be ”big” in the right way.
  • 16. Somewhat scary experience from previous job • Denmark has long tradition for systematic collection of data in dairy industry. • Big data is big issue; automatic collection of loads of online data on individual cows. • Dairy industry gets lots out of their databases – but they could get more…
  • 17. • Cattle people [CP]: We have this large database and would like to use it better: • Data analyst / statistician [DA]: Excellent idea. • [CP]: We think that data science / machine learning / pattern recognition / … can help us. • [DA]: Probably; what specifically are you trying to obtain? • [CP]: Well, we want to improve reproduction and reduce lameness and mastitis problems
  • 18. • [DA]: Interesting; when you say ”improve reproduction” what do you have in mind. • [CP]: What do you mean? • [DA]: We have to be more specific about where we want to go… • [CP]: But we thought that modern data science could help us here… • [DA]: Huh – yes and no…
  • 19.
  • 20. Speaking of data science
  • 21. Speaking of … • R (www.r-project.org) is the statistics system language of choice for most statisticians • Aalborg University hosts useR!2015 http://user2015.math.aau.dk/ • 700 statisticians / data scientists meet in Aalborg on June 30. – July 3. • There is room for more – only hotels are limited..
  • 22. We see patterns everywhere…