This document discusses the opportunities and challenges of big data and data science over the next decade. It outlines three key points:
1. Big data is opening doors to accelerating scientific discovery through generating hypotheses from data and using ensemble models to gain multiple perspectives. However, challenges around efficacy and efficiency remain.
2. Data science can be viewed as applying the scientific method to data through discovering correlations from data-driven models and seeking causation through empirical verification, similar to traditional scientific discovery.
3. For data science to fulfill its potential, its laws and best practices around ensuring meaningful correlations and determining causation through verification must be followed, although they are not always common in practice currently. The limits of data science also
5. What is Big Data?
• Defining Big Data constrains this emerging phenomena
• Since Big Data is not
— About data, but a problem solving ecosystem
— A discipline, but a multidisciplinary sub-domain of most disciplines*
• What matters is what we will do with Big Data
• Big Data is opening the door to profound change in
— Processing
— Thinking
• Let’s use the potential of profound change to understand Big Data
5
*
“transforma,ve
…
changing
academia
(…
emerged
..
on
the
cri,cal
path
for
their
sub-‐discipline)”
and
is
changing
society”
Michael
Jordan.
6. Starting to Understand Big Data
• Listen to Data
— Hypothesis generation ! overcome limits of human cognition*
• Multiple, Simultaneous Perspectives
— Ensemble models ! Accelerating Scientific Discovery*
• And many more …
6
* Necessary condition: human-guidance
7. 7
Big Data is in its infancy
With at least decade-long challenges
8. Outline
• Big Picture: Why and What
• Grand Opportunities
• Grand Challenges
— Efficacy, amongst many
• Laws and Limits of Data Science
10. Big Picture:
Why & What
Experiment Model What
(Big Data)
Why
(Empiricism)
Correlation:
What might occur
Causation:
Why it occurs
Phenomenon
11. Why: Scientific Method and the Search for Causation
History of Science and the Scientific Method
Mature Disciplines: Empiricism, Clinical Studies, Drug Discovery
The Holy Grail of science is to identify accurate causality.
Empirical, clinical trial, and drug discovery methods take time +100 years
Three Ages of Medicine [The Remedy: Goetz]
Free-for-All: 1850s–1940s
Rise of Trials: 1940s–2010s
Beyond the Lab: Post-2010
12. What: Models and the Search for Meaningful
Correlations
• History of Modelling: mathematics, sciences, computing, …
• Disciplines
" Mature (theory-driven): math, physics, statistics, …
" Emerging (data-driven): data mining, machine learning, neural networks, support
vector machines, …
The Holy Grail of data-intensive discovery is correlations that are
accurate and
reliable.
meaningful.
The Holy Grail of data-intensive discovery is correlations that are meaningful.
Correlation does not imply causation
• Methodologies
" Mature: 100s of years
" Emerging: at least a decade
14. Accelerating Scientific Discovery
Hypotheses
Experiment Model
Correlations
Why:
Causation
What:
Correlation
Theory Driven Data Driven
15. Accelerating Scientific Discovery
Hypotheses
Experiment Model
Correlations
Why:
Causation
Theory Driven Data Driven
Baylor Watson
Scientists
What:
Correlation
Wonderful
Use Case
16. Grand Challenges
• Big Data is in its infancy: 10+ year evolution
" Efficiency: expression/language ! execution (stack)
" Open Data: data use/reuse / sharing
" Efficacy
“major engineering and mathematical challenge, one
that will not be solved by just gluing together a few
existing ideas from statistics, optimization, databases
and computer systems.” Michael Jordan
17. “wrt to Big Data we’re now at the what are the
principles? point in time”. Michael Jordan
18. What is Data Science @ Scale?
Data Science @ scale is to data-intensive discovery as
The Scientific Method is to scientific discovery
Reframe Empiricism*
" Data Science is the data component of the Scientific Method for data
" Concepts, tools, and techniques for data-intensive discovery
• Data-intensive discovery = virtual experiment
" Laws and Limits of Data Science
* With Dr. Jennie Duggan, MIT & Northwestern University
19. First Law of Data Science
Meaning of a correlation requires empirical verification
What is seldom enough
Why is not always necessary
Best Practice #1: Efficacy-driven data discovery
(Efficacy before efficiency)
20. Second Law of Data Science*
Causality can be determined from correlations only by
community accepted mechanisms and metrics**, e.g.,
empiricism.
* With Gregory Piatetsky-Shapiro, KDNuggets
** for What and Why
21. Limits of Data Science
We do not know where our concepts, tools, and
techniques break on massive data sets!
Caution: Big Data Winter Potential (Michael Jordan)
Best Practice #2: Experiment + Error bars everywhere
" Common Practice: not so much
Best Practice #3: Machine-driven, human guided
" Common Practice: not so much
22. Best Practice Not So Common*
• BP1: Efficacy-driven data discovery
" Best eScience, Journalism, Economics, Computational X, …
" Big Data not so much (<5%)
• BP2: Experiment + Error bars everywhere
" Above + Best Data Scientists (~5%, w/scientific, ML, … training)
" Big Data (<5%): Customers don’t ask; data scientists don’t practice
• BP3: Machine-driven, human guided
" ~5% strict;95% not so much, e.g., ~60 Data Curation products
" 50% partial: supervised / trained
• Example: based on the above Laws and Best Practices
*Personal un-scientific study, limited data, yet so unbiased and oh so true
23. Laws of Data Science Less So …
1st Correlations ≠ Causation
Common confusion in science*, more in Data Science, even more in business
2nd Causality (meaning) requires verification by community-accepted norms
Cornerstone of Science, hopefully emerging in Data Science**
*Richard Feynman, 1974
** If #1 is rare, #2 is more so
24. Conclusions
• Big Data is in its infancy and is opening the door to …
• Grand Opportunities
• Grand Challenges
• 10+ year evolution
• Data Science ~= Scientific Method For Data
• Laws of Data Science
1 Correlations must be verified
2 Verification relative to community-accepted norms
• Data Science Best Practices
1 Efficacy-driven discovery
2 Experiment + Error Bars everywhere
3 Machine-Driven – Human Guided
• Limit of Data Science: we do not know where our tools break