SlideShare a Scribd company logo
International School of Physics
“Enrico Fermi”
Workshop 205
Big Data Analytics
27-29 June 2019
Villa Monastero, Varenna
Big Data and Data Science
Fabio Stella
University of Milan-Bicocca
Department of Informatics, Systems and Communication
The 100,000
Genomes Project
▪ complete sequencing of
70,000 individuals
▪ 21 PetaBytes of data
▪ 1 PetaByte of music requires
about 2,000 years to be
listened
The complexity increases with
digital data
▪ EHR
▪ Life style
▪ Nutrition
▪ Job type
▪ Geographical
▪ Social networks
Smart Cities
▪ Automate
▪ Control
▪ See real-time data
▪ Home automation
▪ Environment monitoring
▪ Medical and healtcare
▪ Smart transportation
▪ Smart manufactuing
▪ Energy resource
management
Large Hadron Collider
▪ 2017 June 29 at CERN,
200 PB of data permanently
stored.
▪ 1 billion collisions per
second happening at,
ATLAS, CMS, ALICE, and
LHCb.
According to [1],
• Big Data and Data Science are being used as buzzwords and are composites of
many concepts.
• Big data appears frequently in the press and academic journals.
• In the last five years; birth and growth of many data science programs in academia.
• In 2012, the White House Office of Science and Technology Policy announced the
Big Data Research and Development Initiative that builds upon federal initiatives
ranging from
✓ computer architecture and networking technologies,
✓ algorithms,
✓ data management,
✓ artificial intelligence,
✓ machine learning,
✓ advanced cyber-infrastructure.
[1] Brady H. E., Annual Review of Political Science, 22 (2019) 297.
Big Data 5 V’s
Volume; the amount of data
to be stored and managed.
▪ It is probably the first thing
people think of when they
hear the phrase big data.
▪ We typically talk about big
data in the case where
ExaBytes or PetaBytes
have to be stored and
managed.
Big Data 5 V’s
Velocity; defines the speed
of increase in big data volume
and its relative accessibility.
▪ This dimension helps
organizations to realize the
relative growth of their big
data and how quickly that
data reaches sourcing
users, applications and
systems.
Big Data 5 V’s
Variety; we no longer have
control over the data format,
i.e. we must manage
heterogeneous data, with
different formats, semantics
and structures.
▪ In other words we could be
asked to combine;
✓ pure text,
✓ photo,
✓ audio,
✓ video,
✓ web,
✓ GPS data,
✓ sensor data,
✓ relational data bases,
✓ and documents
to answer a given question.
Big Data 5 V’s
Veracity; how accurate or
truthful a data set may be.
▪ In the context of big data,
however, it takes on a bit
more meaning.
▪ When it comes to the
accuracy of big data, it’s not
just the quality of the data
itself but how trustworthy
the data source, type, and
processing of it is.
▪ Removing things like bias,
abnormalities or
inconsistencies, duplication,
and volatility are just a few
aspects that factor into
improving the accuracy of
big data.
Big Data 5 V’s
Value; the worth of the data
being extracted.
▪ Having endless amounts of
data is one thing, but unless
it can be turned into value it
is useless.
▪ While there is a clear link
between data and insights,
this does not always mean
there is value in Big Data.
▪ The most important part of
embarking on a big data
initiative is to understand
the costs and benefits of
collecting and analyzing the
data to ensure that
ultimately the data that is
reaped can be monetized.
Big Data 5 V’s
Artificial Intelligence;
intelligence demonstrated by
machines, in contrast to the
natural intelligence displayed
by humans.
▪ Colloquially, the term
Artificial Intelligence (AI) is
used to describe
machines/computers that
mimic “cognitive” functions
that humans associate
with other human minds,
such as "learning" and
"problem solving".
▪ Two kinds of AI:
✓ Weak
✓ Strong
Artificial Intelligence
Bayesian Networks
▪ A type of statistical model
that represents a set of
variables and their
conditional dependencies
via a directed acyclic
graph (DAG).
▪ Bayesian networks are
ideal for taking an event
that occurred and
predicting the likelihood
that any one of several
possible known causes
was the contributing
factor.
▪ Structural Causal Models.
Artificial Intelligence
Knowledge Graphs
▪ A knowledge base used
by Google and its
services to enhance its
search engine's results
with information gathered
from a variety of sources.
▪ The information is
presented to users in an
infobox.
Machine Learning;
algorithms and statistical
models that computer
systems use in order to
perform a specific task
effectively without using
explicit instructions, relying
on patterns and inference
instead.
Three kinds of ML;
▪ Supervised
▪ Self-Supervised
▪ Reinforcement Learning
Machine Learning
Supervised
▪ Classification
SETOSA
VERSICOLOR
VIRGINICA
Machine Learning
Supervised
▪ Curve fitting
▪ Surface fitting
Machine Learning
Self-Supervised
▪ Recommendation
System
▪ Market Basket Analysis
▪ Social Network Analysis
Machine Learning
Reinforcement
Learning
▪ Learn by interacting with the
environment
▪ The environment reacts to
our decisions/actions
▪ Sequential learning, only at
the end of the game we
know our performance
(reward/punishment)
Deep Learning; is part
of a broader family of
machine learning methods
based on Artificial Neural
Networks.
Three kinds of DL;
▪ Supervised
▪ Self-Supervised
▪ Reinforcement Learning
Deep Learning
Feedforward Neural
Networks
▪ The first and simplest type
of artificial neural network
devised.
▪ The information moves in
only one direction,
forward, from the input
nodes, through the hidden
nodes (if any) and to the
output nodes.
▪ There are no cycles or
loops in the network.
8 pixels
8pixels
X1
X64
Y0
Y9
…
…
…
…
Deep Learning
Convolutional
Networks
▪ Regularized versions of
multilayer perceptrons
which are fully connected
and thus prone to
overfitting the data.
▪ Regularization by adding
some form of magnitude
measurement of weights
to the loss function.
▪ Different approach
towards regularization:
take advantage of the
hierarchical pattern in data
and assemble more
complex patterns using
smaller and simpler
patterns.
Deep Learning
LSTM Networks
▪ An artificial Recurrent
Neural Network
architecture.
▪ Unlike standard
feedforward neural
networks, LSTM has
feedback connections that
make it a "general
purpose computer" (it can
compute anything that a
Turing machine can).
▪ LSTM started to
revolutionize speech
recognition, outperforming
traditional models in
certain speech
applications.
“The vision systems of the eagle and the snake outperform
everything that we can make in the laboratory, but snakes
and eagles cannot build and eyeglass or a telescope or a
microscope."
— Judea Pearl
Fitting does not give us any understanding
Data Science
▪ Applied Mathematics
▪ Computer Science
▪ Neuroscience
▪ Engineering
▪ Statistics
▪ Biology
▪ Physics
Data science is the application of computational and statistical
techniques to address or gain insight into some problem in the
real world (J. Zico Kolter, Carnegie Mellon University, 2018)
Data science is an extraordinary multidisciplinary and
interdisciplinary challenge to apply the scientific method
empowered by
• data,
• models,
• computational resources,
• open source software, and
… the most sophisticated technology we have today, i.e. the
human being.
Causal Inference and the
data-fusion problem
Fabio Stella
University of Milan-Bicocca
Department of Informatics, Systems and Communication
International School of Physics
“Enrico Fermi”
Workshop 205
Big Data Analytics
27-29 June 2019
Villa Monastero, Varenna
What can truly
be achieved?
▪ No matter how much
data you collect, the
question can not be
answered when using
the data alone
Does Exercise affects
Cholesterol?
Simpson’s Paradox
Age
Cholesterol Exercise
Big Data and the
data-fusion problem
▪ Piecing together multiple
datasets collected under
heterogeneous conditions
(i.e., different populations,
regimes, and sampling
methods) to obtain valid
answers to queries of
interest.
▪ The availability of multiple
heterogeneous datasets
presents new
opportunities to big data
analysts, because the
knowledge that can be
acquired from combined
data would not be
possible from any
individual source alone.
The Ladder
of Causation
Seeing; we are looking for
regularities in observations.
“What if I see …?”
Calls for predictions based on passive observations.
It is characterized by the question “What if I see …?”
For instance, imagine a marketing director at a
department store who asks,
“How likely is a customer who bought toothpaste to
also buy dental floss?”
“What if do …?” & “How?”
We step up to the next level of causal queries
when we begin to change the world. A typical
question for this level is
“What will happen to our floss sales if we double
the price of toothpaste?”
This already calls for a new
kind of knowledge, absent
from the data, which we find at
rung two of the Ladder of
Causation, Intervention.
Intervention; ranks
higher than association
because it involves not just
seeing but changing what is.
Many scientists have been quite traumatized to learn
that none of the methods they learned in statistics is
sufficient even to articulate, let alone answer, a simple
question like
“What happens if we double the price?”
“What if I had done …?” & “Why?”
Counterfactuals; ranks
higher than intervention
because it involves
imagining, retrospection
and understanding.
We might wonder, My
headache is gone now, but
• Why?
• Was it the aspirin I took?
• The food I ate?
• The good news I heard?
These queries take us to the top rung of the Ladder of
Causation, the level of Counterfactuals, because to
answer them we must go back in time, change history,
and ask,
“What would have happened if I had not taken the
aspirin?”
No experiment in the world can deny treatment to an
already treated person and compare the two outcomes,
so we must import a whole new kind of knowledge.
Z1 is a direct cause of W1, Z3
Z1 is an indirect cause of X, Y, W3
Cause
A variable X is a cause of a
variable Y if Y “listens” to X
and decides its value in
response to what it hears.
▪ Direct cause
▪ Indirect cause
Causal Graph
Exogenous variables;
background conditions for
which no explanatory
mechanism is encoded in
model M.
Every instantiation U=u of the
exogenous variables uniquely
determines the values of all
variables in V and, hence, if
we assign a probability P(u)
to U, it induces a probability
function P(v) on V.
The vector U=u can also be
interpreted as an
experimental “unit” that can
stand for an individual
subject, time of day, …
Exogenous Variables
𝑈 = 𝑍1, 𝑍2
Endogenous Variables
𝑉 = 𝑍3, 𝑋, 𝑊1, 𝑊2, 𝑊3, 𝑌
Causal Graph
𝑥 = 𝑓1 𝑤1, 𝑧3
Structural
Causal
Model
𝑈 = 𝑍1, 𝑍2 𝑉 = 𝑍3, 𝑋, 𝑊1, 𝑊2, 𝑊3, 𝑌
𝐹 = 𝑓1, 𝑓2, …
A structural model M consists of two sets of variables U and V, and a set
F of functions that determine or simulate how values are assigned to each
variable Vi ∈V.
Thus, for example, the equation
𝑣𝑖 = 𝑓𝑖 𝑢, 𝑣
describes a physical process by which variable Vi is assigned the value
𝑣𝑖 = 𝑓𝑖 𝑢, 𝑣 in response to the current values, 𝑣 and 𝑢, of the variables in
V and U.
Formally, the triplet <U,V,F> defines a Structural Causal Model, and the
diagram that captures the relationships among the variables is called the
Causal Graph G (of M).
Cause
A variable X is a cause of a
variable Y if Y “listens” to X
and decides its value in
response to what it hears.
▪ Direct cause
▪ Indirect cause
Conditional
Independence
Basic connections
▪ Chain
▪ Diverging
▪ Converging or Collider
d-separation
𝑃 𝑋|𝑌, 𝑍 = 𝑃 𝑋|𝑍
𝑃 𝑋|𝑌, 𝑍  𝑃 𝑋|𝑍
Gender
Discrimination?
You want to know whether
and to what degree a
company discriminates by
Gender in its Hiring
practices.
Such discrimination would
constitute a direct effect of
Gender on Hiring, which is
illegal in many cases.
However, women are more or
less likely to go into a
particular field than men, or to
have achieved advanced
degrees in that field.
So Gender may also have an
indirect effect on Hiring
through the mediating
variable of Qualification.
Qualification is a mediating variable
Gender
Discrimination?
In order to find the direct
effect of Gender on Hiring,
we need to somehow hold
Qualification steady, and
measure the remaining
relationship between Gender
and Hiring;
with Qualification
unchanging, any change in
Hiring would have to be due
to Gender alone.
𝑃 ℎ𝑖𝑟𝑒𝑑|𝑓𝑒𝑚𝑎𝑙𝑒, ℎ𝑞 < 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑚𝑎𝑙𝑒, ℎ𝑞
highly qualified (hq)
The Confounder
What happens if there are
confounders of the
mediating variable and the
outcome variable.
For instance, Income:
People from higher income
backgrounds are more likely
to have gone to college and
more likely to have
connections that would help
them get hired.
The Confounder
Now, if we condition on
Qualification, we are
conditioning on a Collider.
The Confounder
Now, if we condition on
Qualification, we are
conditioning on a Collider.
Gender and Income
communicate and thus it is
no longer possible to judge
the effect of Gender on
Hiring alone.
Income is a confounder for Qualification
The Confounder
However, if we don’t
condition on Qualification,
indirect dependence can
pass from Gender to Hiring
through the path to the right.
No matter how you look at it,
we’re not getting the true
direct effect of Gender on
Hiring.
The Confounder
No matter how you look at it,
we’re not getting the true
direct effect of Gender on
Hiring.
Traditionally, therefore,
statistics has had to abandon
a huge class of potential
mediation problems, where
the concept of “direct effect”
could not be defined, let
alone estimated.
𝑃 ℎ𝑖𝑟𝑒𝑑|𝑑𝑜(𝑓𝑒𝑚𝑎𝑙𝑒), 𝑑𝑜(ℎ𝑞) < 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑑𝑜(𝑚𝑎𝑙𝑒), 𝑑𝑜(ℎ𝑞)The “do” operator
Luckily, we now have a
conceptual way of holding
the mediating variable steady
without conditioning on it:
We can intervene on it.
If, instead of conditioning, we
fix the qualifications, the
arrow between gender and
qualifications (and the one
between income and
qualifications) disappears,
and no spurious dependence
can pass through it.
The “do” operator
Luckily, we now have a
conceptual way of holding
the mediating variable steady
without conditioning on it:
We can intervene on it.
If, instead of conditioning, we
fix the qualifications, the
arrow between gender and
qualifications (and the one
between income and
qualifications) disappears,
and no spurious dependence
can pass through it.
Conclusions
▪ Big Data offers an extraordinary opportunity to understand how things
work, this holds almost in any discipline and application domain.
▪ Causality will play a central role in Big Data, one of the most
challenging problems is data-fusion (How to combine heterogeneous
data collected from different sources?)
▪ Data Science can play a major role in the incoming revolution, “The
revolution hasn't happened yet.” (Michael Jordan).
▪ More interdisciplinary and multidisciplinary efforts are needed.
▪ Need to think about a possible alternative model for education, both
low and high level.
International School of Physics
“Enrico Fermi”
Workshop 205
Big Data Analytics
27-29 June 2019
Villa Monastero, Varenna
References
▪ [1] Brady H. E., Annual Review of Political Science, 22 (2019) 297.

More Related Content

What's hot

Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
StampedeCon
 
Big data, big opportunities
Big data, big opportunitiesBig data, big opportunities
Big data, big opportunities
Chouaieb NEMRI
 
Big Data and the Quantified Self
Big Data and the Quantified SelfBig Data and the Quantified Self
Big Data and the Quantified Self
Melanie Swan
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
Physical Cyber Social Computing
Physical Cyber Social ComputingPhysical Cyber Social Computing
Physical Cyber Social Computing
Amit Sheth
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
Johann van Wyk
 
Big data march2016 ipsos mori
Big data march2016 ipsos moriBig data march2016 ipsos mori
Big data march2016 ipsos mori
Chris Guthrie
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
Murad Daryousse
 
Philosophy of Big Data
Philosophy of Big DataPhilosophy of Big Data
Philosophy of Big Data
Melanie Swan
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
Tuan Yang
 
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Artificial Intelligence Institute at UofSC
 
Less is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/OLess is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/O
Michael Roytman
 
Ibm watson - how it works, and what it means for society beyond winning jeo...
Ibm   watson - how it works, and what it means for society beyond winning jeo...Ibm   watson - how it works, and what it means for society beyond winning jeo...
Ibm watson - how it works, and what it means for society beyond winning jeo...
Rick Bouter
 
Big Data Ethics
Big Data EthicsBig Data Ethics
Big Data Ethics
Nael Radwan
 
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
AnthonyOtuonye
 
Smart IoT for Connected Manufacturing
Smart IoT for Connected ManufacturingSmart IoT for Connected Manufacturing
Smart IoT for Connected Manufacturing
Amit Sheth
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Amit Sheth
 
Quantified Self Ideology: Personal Data becomes Big Data
Quantified Self Ideology:  Personal Data becomes Big DataQuantified Self Ideology:  Personal Data becomes Big Data
Quantified Self Ideology: Personal Data becomes Big Data
Melanie Swan
 
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
Amit Sheth
 
What's up at Kno.e.sis?
What's up at Kno.e.sis? What's up at Kno.e.sis?
What's up at Kno.e.sis?
Amit Sheth
 

What's hot (20)

Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014
 
Big data, big opportunities
Big data, big opportunitiesBig data, big opportunities
Big data, big opportunities
 
Big Data and the Quantified Self
Big Data and the Quantified SelfBig Data and the Quantified Self
Big Data and the Quantified Self
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
Physical Cyber Social Computing
Physical Cyber Social ComputingPhysical Cyber Social Computing
Physical Cyber Social Computing
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Big data march2016 ipsos mori
Big data march2016 ipsos moriBig data march2016 ipsos mori
Big data march2016 ipsos mori
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 
Philosophy of Big Data
Philosophy of Big DataPhilosophy of Big Data
Philosophy of Big Data
 
BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?BIG DATA | How to explain it & how to use it for your career?
BIG DATA | How to explain it & how to use it for your career?
 
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
Data Processing and Semantics for Advanced Internet of Things (IoT) Applicati...
 
Less is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/OLess is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/O
 
Ibm watson - how it works, and what it means for society beyond winning jeo...
Ibm   watson - how it works, and what it means for society beyond winning jeo...Ibm   watson - how it works, and what it means for society beyond winning jeo...
Ibm watson - how it works, and what it means for society beyond winning jeo...
 
Big Data Ethics
Big Data EthicsBig Data Ethics
Big Data Ethics
 
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
 
Smart IoT for Connected Manufacturing
Smart IoT for Connected ManufacturingSmart IoT for Connected Manufacturing
Smart IoT for Connected Manufacturing
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
 
Quantified Self Ideology: Personal Data becomes Big Data
Quantified Self Ideology:  Personal Data becomes Big DataQuantified Self Ideology:  Personal Data becomes Big Data
Quantified Self Ideology: Personal Data becomes Big Data
 
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...
 
What's up at Kno.e.sis?
What's up at Kno.e.sis? What's up at Kno.e.sis?
What's up at Kno.e.sis?
 

Similar to 2019 June 27 - Big data and data science

Data science
Data scienceData science
Data science
DeekshaSrivas
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
sasi
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
Muhammad Rumman Islam Nur
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
Kelly Technologies
 
Understand the Idea of Big Data and in Present Scenario
Understand the Idea of Big Data and in Present ScenarioUnderstand the Idea of Big Data and in Present Scenario
Understand the Idea of Big Data and in Present Scenario
AI Publications
 
Big Data World
Big Data WorldBig Data World
Big Data World
Hossein Zahed
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
rajsharma159890
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart data
caniceconsulting
 
Is big data just a buzzword -Big data simply explained
Is big data just a buzzword -Big data simply explainedIs big data just a buzzword -Big data simply explained
Is big data just a buzzword -Big data simply explained
Vivek Srivastava
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
Philip Bourne
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Putting data science into perspective
Putting data science into perspectivePutting data science into perspective
Putting data science into perspective
Sravan Ankaraju
 
Big data Paper
Big data PaperBig data Paper
Big data Paper
Daryaz Fares
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
Philip Bourne
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
shalini s
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
Hari Priya
 
Notes from the Observation Deck // A Data Revolution
Notes from the Observation Deck // A Data Revolution Notes from the Observation Deck // A Data Revolution
Notes from the Observation Deck // A Data Revolution
gngeorge
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
Rothamsted Research, UK
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Professor Lili Saghafi
 

Similar to 2019 June 27 - Big data and data science (20)

Data science
Data scienceData science
Data science
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Understand the Idea of Big Data and in Present Scenario
Understand the Idea of Big Data and in Present ScenarioUnderstand the Idea of Big Data and in Present Scenario
Understand the Idea of Big Data and in Present Scenario
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart data
 
Is big data just a buzzword -Big data simply explained
Is big data just a buzzword -Big data simply explainedIs big data just a buzzword -Big data simply explained
Is big data just a buzzword -Big data simply explained
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Putting data science into perspective
Putting data science into perspectivePutting data science into perspective
Putting data science into perspective
 
Big data Paper
Big data PaperBig data Paper
Big data Paper
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Notes from the Observation Deck // A Data Revolution
Notes from the Observation Deck // A Data Revolution Notes from the Observation Deck // A Data Revolution
Notes from the Observation Deck // A Data Revolution
 
Better Data for a Better World
Better Data for a Better WorldBetter Data for a Better World
Better Data for a Better World
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
 

Recently uploaded

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
NoelManyise1
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 

Recently uploaded (20)

Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiologyBLOOD AND BLOOD COMPONENT- introduction to blood physiology
BLOOD AND BLOOD COMPONENT- introduction to blood physiology
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 

2019 June 27 - Big data and data science

  • 1. International School of Physics “Enrico Fermi” Workshop 205 Big Data Analytics 27-29 June 2019 Villa Monastero, Varenna Big Data and Data Science Fabio Stella University of Milan-Bicocca Department of Informatics, Systems and Communication
  • 2. The 100,000 Genomes Project ▪ complete sequencing of 70,000 individuals ▪ 21 PetaBytes of data ▪ 1 PetaByte of music requires about 2,000 years to be listened The complexity increases with digital data ▪ EHR ▪ Life style ▪ Nutrition ▪ Job type ▪ Geographical ▪ Social networks
  • 3. Smart Cities ▪ Automate ▪ Control ▪ See real-time data ▪ Home automation ▪ Environment monitoring ▪ Medical and healtcare ▪ Smart transportation ▪ Smart manufactuing ▪ Energy resource management
  • 4. Large Hadron Collider ▪ 2017 June 29 at CERN, 200 PB of data permanently stored. ▪ 1 billion collisions per second happening at, ATLAS, CMS, ALICE, and LHCb.
  • 5. According to [1], • Big Data and Data Science are being used as buzzwords and are composites of many concepts. • Big data appears frequently in the press and academic journals. • In the last five years; birth and growth of many data science programs in academia. • In 2012, the White House Office of Science and Technology Policy announced the Big Data Research and Development Initiative that builds upon federal initiatives ranging from ✓ computer architecture and networking technologies, ✓ algorithms, ✓ data management, ✓ artificial intelligence, ✓ machine learning, ✓ advanced cyber-infrastructure. [1] Brady H. E., Annual Review of Political Science, 22 (2019) 297.
  • 6. Big Data 5 V’s
  • 7. Volume; the amount of data to be stored and managed. ▪ It is probably the first thing people think of when they hear the phrase big data. ▪ We typically talk about big data in the case where ExaBytes or PetaBytes have to be stored and managed. Big Data 5 V’s
  • 8. Velocity; defines the speed of increase in big data volume and its relative accessibility. ▪ This dimension helps organizations to realize the relative growth of their big data and how quickly that data reaches sourcing users, applications and systems. Big Data 5 V’s
  • 9. Variety; we no longer have control over the data format, i.e. we must manage heterogeneous data, with different formats, semantics and structures. ▪ In other words we could be asked to combine; ✓ pure text, ✓ photo, ✓ audio, ✓ video, ✓ web, ✓ GPS data, ✓ sensor data, ✓ relational data bases, ✓ and documents to answer a given question. Big Data 5 V’s
  • 10. Veracity; how accurate or truthful a data set may be. ▪ In the context of big data, however, it takes on a bit more meaning. ▪ When it comes to the accuracy of big data, it’s not just the quality of the data itself but how trustworthy the data source, type, and processing of it is. ▪ Removing things like bias, abnormalities or inconsistencies, duplication, and volatility are just a few aspects that factor into improving the accuracy of big data. Big Data 5 V’s
  • 11. Value; the worth of the data being extracted. ▪ Having endless amounts of data is one thing, but unless it can be turned into value it is useless. ▪ While there is a clear link between data and insights, this does not always mean there is value in Big Data. ▪ The most important part of embarking on a big data initiative is to understand the costs and benefits of collecting and analyzing the data to ensure that ultimately the data that is reaped can be monetized. Big Data 5 V’s
  • 12. Artificial Intelligence; intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. ▪ Colloquially, the term Artificial Intelligence (AI) is used to describe machines/computers that mimic “cognitive” functions that humans associate with other human minds, such as "learning" and "problem solving". ▪ Two kinds of AI: ✓ Weak ✓ Strong
  • 13. Artificial Intelligence Bayesian Networks ▪ A type of statistical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). ▪ Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. ▪ Structural Causal Models.
  • 14. Artificial Intelligence Knowledge Graphs ▪ A knowledge base used by Google and its services to enhance its search engine's results with information gathered from a variety of sources. ▪ The information is presented to users in an infobox.
  • 15. Machine Learning; algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. Three kinds of ML; ▪ Supervised ▪ Self-Supervised ▪ Reinforcement Learning
  • 17. Machine Learning Supervised ▪ Curve fitting ▪ Surface fitting
  • 18. Machine Learning Self-Supervised ▪ Recommendation System ▪ Market Basket Analysis ▪ Social Network Analysis
  • 19. Machine Learning Reinforcement Learning ▪ Learn by interacting with the environment ▪ The environment reacts to our decisions/actions ▪ Sequential learning, only at the end of the game we know our performance (reward/punishment)
  • 20.
  • 21. Deep Learning; is part of a broader family of machine learning methods based on Artificial Neural Networks. Three kinds of DL; ▪ Supervised ▪ Self-Supervised ▪ Reinforcement Learning
  • 22. Deep Learning Feedforward Neural Networks ▪ The first and simplest type of artificial neural network devised. ▪ The information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. ▪ There are no cycles or loops in the network. 8 pixels 8pixels X1 X64 Y0 Y9 … … … …
  • 23. Deep Learning Convolutional Networks ▪ Regularized versions of multilayer perceptrons which are fully connected and thus prone to overfitting the data. ▪ Regularization by adding some form of magnitude measurement of weights to the loss function. ▪ Different approach towards regularization: take advantage of the hierarchical pattern in data and assemble more complex patterns using smaller and simpler patterns.
  • 24. Deep Learning LSTM Networks ▪ An artificial Recurrent Neural Network architecture. ▪ Unlike standard feedforward neural networks, LSTM has feedback connections that make it a "general purpose computer" (it can compute anything that a Turing machine can). ▪ LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications.
  • 25. “The vision systems of the eagle and the snake outperform everything that we can make in the laboratory, but snakes and eagles cannot build and eyeglass or a telescope or a microscope." — Judea Pearl
  • 26. Fitting does not give us any understanding
  • 27. Data Science ▪ Applied Mathematics ▪ Computer Science ▪ Neuroscience ▪ Engineering ▪ Statistics ▪ Biology ▪ Physics Data science is the application of computational and statistical techniques to address or gain insight into some problem in the real world (J. Zico Kolter, Carnegie Mellon University, 2018) Data science is an extraordinary multidisciplinary and interdisciplinary challenge to apply the scientific method empowered by • data, • models, • computational resources, • open source software, and … the most sophisticated technology we have today, i.e. the human being.
  • 28. Causal Inference and the data-fusion problem Fabio Stella University of Milan-Bicocca Department of Informatics, Systems and Communication International School of Physics “Enrico Fermi” Workshop 205 Big Data Analytics 27-29 June 2019 Villa Monastero, Varenna
  • 29. What can truly be achieved? ▪ No matter how much data you collect, the question can not be answered when using the data alone Does Exercise affects Cholesterol? Simpson’s Paradox Age Cholesterol Exercise
  • 30. Big Data and the data-fusion problem ▪ Piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. ▪ The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone.
  • 32. Seeing; we are looking for regularities in observations. “What if I see …?” Calls for predictions based on passive observations. It is characterized by the question “What if I see …?” For instance, imagine a marketing director at a department store who asks, “How likely is a customer who bought toothpaste to also buy dental floss?”
  • 33. “What if do …?” & “How?” We step up to the next level of causal queries when we begin to change the world. A typical question for this level is “What will happen to our floss sales if we double the price of toothpaste?” This already calls for a new kind of knowledge, absent from the data, which we find at rung two of the Ladder of Causation, Intervention. Intervention; ranks higher than association because it involves not just seeing but changing what is. Many scientists have been quite traumatized to learn that none of the methods they learned in statistics is sufficient even to articulate, let alone answer, a simple question like “What happens if we double the price?”
  • 34. “What if I had done …?” & “Why?” Counterfactuals; ranks higher than intervention because it involves imagining, retrospection and understanding. We might wonder, My headache is gone now, but • Why? • Was it the aspirin I took? • The food I ate? • The good news I heard? These queries take us to the top rung of the Ladder of Causation, the level of Counterfactuals, because to answer them we must go back in time, change history, and ask, “What would have happened if I had not taken the aspirin?” No experiment in the world can deny treatment to an already treated person and compare the two outcomes, so we must import a whole new kind of knowledge.
  • 35. Z1 is a direct cause of W1, Z3 Z1 is an indirect cause of X, Y, W3 Cause A variable X is a cause of a variable Y if Y “listens” to X and decides its value in response to what it hears. ▪ Direct cause ▪ Indirect cause
  • 36. Causal Graph Exogenous variables; background conditions for which no explanatory mechanism is encoded in model M. Every instantiation U=u of the exogenous variables uniquely determines the values of all variables in V and, hence, if we assign a probability P(u) to U, it induces a probability function P(v) on V. The vector U=u can also be interpreted as an experimental “unit” that can stand for an individual subject, time of day, … Exogenous Variables 𝑈 = 𝑍1, 𝑍2
  • 37. Endogenous Variables 𝑉 = 𝑍3, 𝑋, 𝑊1, 𝑊2, 𝑊3, 𝑌 Causal Graph
  • 38. 𝑥 = 𝑓1 𝑤1, 𝑧3 Structural Causal Model 𝑈 = 𝑍1, 𝑍2 𝑉 = 𝑍3, 𝑋, 𝑊1, 𝑊2, 𝑊3, 𝑌 𝐹 = 𝑓1, 𝑓2, …
  • 39. A structural model M consists of two sets of variables U and V, and a set F of functions that determine or simulate how values are assigned to each variable Vi ∈V. Thus, for example, the equation 𝑣𝑖 = 𝑓𝑖 𝑢, 𝑣 describes a physical process by which variable Vi is assigned the value 𝑣𝑖 = 𝑓𝑖 𝑢, 𝑣 in response to the current values, 𝑣 and 𝑢, of the variables in V and U. Formally, the triplet <U,V,F> defines a Structural Causal Model, and the diagram that captures the relationships among the variables is called the Causal Graph G (of M). Cause A variable X is a cause of a variable Y if Y “listens” to X and decides its value in response to what it hears. ▪ Direct cause ▪ Indirect cause
  • 40. Conditional Independence Basic connections ▪ Chain ▪ Diverging ▪ Converging or Collider d-separation 𝑃 𝑋|𝑌, 𝑍 = 𝑃 𝑋|𝑍 𝑃 𝑋|𝑌, 𝑍  𝑃 𝑋|𝑍
  • 41. Gender Discrimination? You want to know whether and to what degree a company discriminates by Gender in its Hiring practices. Such discrimination would constitute a direct effect of Gender on Hiring, which is illegal in many cases. However, women are more or less likely to go into a particular field than men, or to have achieved advanced degrees in that field. So Gender may also have an indirect effect on Hiring through the mediating variable of Qualification. Qualification is a mediating variable
  • 42. Gender Discrimination? In order to find the direct effect of Gender on Hiring, we need to somehow hold Qualification steady, and measure the remaining relationship between Gender and Hiring; with Qualification unchanging, any change in Hiring would have to be due to Gender alone. 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑓𝑒𝑚𝑎𝑙𝑒, ℎ𝑞 < 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑚𝑎𝑙𝑒, ℎ𝑞 highly qualified (hq)
  • 43. The Confounder What happens if there are confounders of the mediating variable and the outcome variable. For instance, Income: People from higher income backgrounds are more likely to have gone to college and more likely to have connections that would help them get hired.
  • 44. The Confounder Now, if we condition on Qualification, we are conditioning on a Collider.
  • 45. The Confounder Now, if we condition on Qualification, we are conditioning on a Collider. Gender and Income communicate and thus it is no longer possible to judge the effect of Gender on Hiring alone. Income is a confounder for Qualification
  • 46. The Confounder However, if we don’t condition on Qualification, indirect dependence can pass from Gender to Hiring through the path to the right. No matter how you look at it, we’re not getting the true direct effect of Gender on Hiring.
  • 47. The Confounder No matter how you look at it, we’re not getting the true direct effect of Gender on Hiring. Traditionally, therefore, statistics has had to abandon a huge class of potential mediation problems, where the concept of “direct effect” could not be defined, let alone estimated.
  • 48. 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑑𝑜(𝑓𝑒𝑚𝑎𝑙𝑒), 𝑑𝑜(ℎ𝑞) < 𝑃 ℎ𝑖𝑟𝑒𝑑|𝑑𝑜(𝑚𝑎𝑙𝑒), 𝑑𝑜(ℎ𝑞)The “do” operator Luckily, we now have a conceptual way of holding the mediating variable steady without conditioning on it: We can intervene on it. If, instead of conditioning, we fix the qualifications, the arrow between gender and qualifications (and the one between income and qualifications) disappears, and no spurious dependence can pass through it.
  • 49. The “do” operator Luckily, we now have a conceptual way of holding the mediating variable steady without conditioning on it: We can intervene on it. If, instead of conditioning, we fix the qualifications, the arrow between gender and qualifications (and the one between income and qualifications) disappears, and no spurious dependence can pass through it.
  • 50. Conclusions ▪ Big Data offers an extraordinary opportunity to understand how things work, this holds almost in any discipline and application domain. ▪ Causality will play a central role in Big Data, one of the most challenging problems is data-fusion (How to combine heterogeneous data collected from different sources?) ▪ Data Science can play a major role in the incoming revolution, “The revolution hasn't happened yet.” (Michael Jordan). ▪ More interdisciplinary and multidisciplinary efforts are needed. ▪ Need to think about a possible alternative model for education, both low and high level. International School of Physics “Enrico Fermi” Workshop 205 Big Data Analytics 27-29 June 2019 Villa Monastero, Varenna
  • 51. References ▪ [1] Brady H. E., Annual Review of Political Science, 22 (2019) 297.