SlideShare a Scribd company logo
1 of 59
Does Data Quality lays
in facts, or in acts?
A journey to the country
of data use, abuse and
misuse
Robert Jeansoulin, emeritus CNRS, Univ. Paris-Est
Workshop question
• Quality assessment of geospatial data:
does it fit your needs?
“needs” are the thread to follow for answering that question.
• Data producer: Quality is part of his product (explicit responsibility
to respect its own specifications).
internal quality IQ (=product quality) Pristine Quality
• User: Quality ? Does he know that he needs it? Depends on the
context: a purpose (possibly implicit) + available data (+ associated
quality) + formulation of queries + computations (multiple diluted
responsibility)
external quality (=contextual quality) Stained Quality
µ values, for instance … in some … in
context another one
Early influencers
Henry Prade (Lotfi Zadeh’s PhD)
on: Possibility Theory and
“Membership” functions (µ)
PhD. U.Toulouse, 1980
L. Zadeh described himself as
“{an American},
{mathematically oriented},
{electrical engineer}
{of Iranian descent},
{born in Russia}.”
µ(Us)=.9
µ(m )=.7
µ(ee)=.8
µ(Iran)=.6
µ(Rus)=.4
=.8
=.8
=.4
=.8
=.6
µ values, for instance … in some
context
µ values, for instance … in some … in
context another one
Early influencers
Henry
Prade (Lotfi Zadeh’s PhD)
L. Zadeh described himself as
“{an American},
{mathematically oriented},
{electrical engineer}
{of Iranian descent},
{born in Russia}.”
µ(Us)=.9
µ(m )=.7
µ(ee)=.8
µ(Iran)=.6
µ(Rus)=.4
=.8
=.8
=.4
=.8
=.6
µ values, for instance … in some
context (flag)
Possibility Theory and Membership functions
µ values, for instance … in some … in
context another context
Early influencers
Henry
Prade (Lotfi Zadeh’s PhD)
L. Zadeh described himself as
“{an American},
{mathematically oriented},
{electrical engineer}
{of Iranian descent},
{born in Russia}.”
µ(Us)=.9
µ(m )=.6
µ(ee)=.8
µ(Iran)=.5
µ(Rus)=.4
=.6
=.8
=.4
=.8
=.6
µ values, for instance … in some
context
Possibility Theory and Membership functions
Early influencers
first met in Luxembourg, 1990
Pete Fisher:
conversations on “Activating quality”,
… which we put in practice …
Henry Lotfi
Prade Zadeh
Possibility Theory and Membership functions
Early influencers
Henry Prade (Lotfi Zadeh’s PhD):
“membership functions”
Toulouse, 1980
Brussels, 1990
Pete Fisher:
“activating quality”.
… we did “activate quality” in the REVIGIS project 1999-2004
at ITC
Recent influences
A journey to the country of
data use, abuse and misuse
When a data scientist frequents the circle of
politicians, diplomats and lawmakers, it
could be a Mark Twain like story …
thou shalt say
the (whole)
truth and
nothing but
the truth
… striving to comply with this rule:
Recent influences
A journey to the country of
data use, abuse and misuse
… finding eventually that half is enough:
thou shalt say
the (whole)
truth and
nothing but
the truth(*) In King Arthur’s court, “whole truth” = as much as necessary, not more
Recent influences
Is “nothing but the truth” enough?
The frequentation of King Arthur’s court can help a
data scientist to deepen its thoughts about needs and
goals, facts and acts, quality and conviction etc. and
ask questions such as …
• Quality in the facts
versus Quality in the acts?
• Quality in the data
versus Fitting the queries?
• Quality in the representation
versus Faith in the processing?
Quality of one data
• Consider this table*
published by the NSF:
* Doctorate recipients with temporary
visas intending to stay in the USA
(excerpt from mandatory forms filled by
all PhD students when defending thesis) France
2007
Staying in the
USA after PhD
69.7 %French PhD 2007:
We are in the era of Big Data, but let’s play it modest first, only one data
Quality of one data
In 2009 a think tank used
that figure, rounded to
70% as the “key data” of
this report (cf. website)
The report got a pretty good media coverage
Where is the quality issue? Not in the accuracy.
What does 70% imply? (rising media interest!)
is it “a lot”? … without a baseline … “yes” in general
But, let’s say: “about 50% of newborns are boys”
everybody knows: it’s not “a lot”, it’s just “usual”.
Quality of one data
Without more data, just one is useless for a comparison,
no matter the accuracy
Quality of one data
74.3 % Much ado about
nothing!
Brain drain doesn’t
threat France!
Report=garbage
investigating deeper the
NSF table year 2007:
All “visa” PhDs intending to
stay =
Quality of two data (evolution)
• OECD PISA surveys (Prog. International Students Assessment)
– Every 3 years since 2003
– each survey is companioned by an OECD report
which focus on so-called key points.
• PISA 2012 survey:
– 2012 companion report focus is: to enhance
comparison with 2003. (Publication: Dec.3rd, 2012)
– impact in France: alert signal from Ministry of
Education = bad results!
Quality of two data (evolution)
• PISA surveys (Prog. International Students Assessment)
– Every 3 years since 2003
• 2012 survey
– Publication: Dec.3rd, 2012
– Since September: alert signal from Ministry of
Education: bad results for France!
450
460
470
480
490
500
510
520
2003 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
Quality of two data (evolution)
• PISA2012 France: score deteriorates in math's
Alert!
From
Ministry
2 dates
… while improving nicely in Germany
450
460
470
480
490
500
510
520
2003 2006 2009 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
Quality of two data (evolution)
4 dates
Alert!
From
Ministry
Now, look at the full dataset
(intermediate dates 2006-2009)
• Where is the quality issue?
450
460
470
480
490
500
510
520
2003 2006 2009 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
Quality of two data (evolution)
When it
really
happened:
6 years
earlier!
4 dates
Alert!
From
Ministry
• Where is the quality issue?
alert=garbage
450
460
470
480
490
500
510
520
2003 2006 2009 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
Not in the accuracy,
Not in the reliability (OECD data)
……
just in the laziness of the report writers and
editors:
all data where in the tables, in open access.
• Where is the quality issue?
Quality of two data (evolution)
Intermediate conclusion
• “Nothing but the truth” is not enough!
Between 1 or 2 data, and Big data … it’s worth spending
time to read a few more data.
Beware: data carry a lot of implicit context:
– origin, scale,
– adjacent data: time intervals,
– broader neighborhood: part-whole relationship
– … it looks like a “mapping” process!
Learn to read datasets, as you learn to read
maps and read as much as possible.
Quality in act is: being aware of the context
a puzzling experience with a .com
Queries:
?:“rue de Miromesnil”
answer: pretty close to my real
destination (as I’ll could see later)
?:“metro nearby” (service option)
answer: looks about 300m from
target, seems Ok.
a puzzling experience with a .com
Queries:
?:“rue de Miromesnil”
answer: pretty close to my real
destination (as I’ll could see later)
?:“metro nearby” (service option)
answer: looks about 300m from
target, seems Ok.
unfortunately…station
Havre-Caumartin is not there!
…but about 1km more East!
Is my “.com” site really
dumb?
investigating further!
• “rue de Miromesnil”: no street # (that’s the user query)
• “métro”: no street # (why?)
• An automatic query was generated to: ratp.fr (official metro website)
what’s going on with ratp.fr…
Same queries on “ratp.fr” site:
 “rue de Miromesnil” (without number): answers a
different location, not in the middle but near
beginning of the street
 “metro” nearby: station Miromesnil (it exists!)
Different protocols: street without a
number is located near #1, by RATP.fr
near street middle by the .com
Query chain in Big Data context
Answers vary depending on the way the query is built:
1 4
3
1. “.com” directly answers query
“rue Miromesnil”
2. “.com” delegates “metro” query
to ratp.fr (specialized)
3. “ratp” answers street address
without #
4. “.com” assumes street middle
(doesn’t check if it complies)
5. Proof: ask “boul. Hausmann” in
step 1, gives you a perfect match
(… which doesn’t fit needs) 
2
ratp
Query chain in Big Data context
Answers vary depending on the way the query is built:
1 4
3
1. “.com” directly answers query
“rue Miromesnil”
2. “.com” delegates “metro” query
to ratp.fr (specialized)
3. “ratp” answers street address
without #
4. “.com” assumes street middle
(doesn’t check if it complies)
5. Proof: ask “boul. Hausmann” in
step 1, gives you a perfect match
(… which doesn’t fit needs) 
2
ratp
Separate queries are correctly handled, but:
the assumption of same protocols and ontologies is wrong!
Quality in act in the
Big Data context =
an unmonitored chain of queries
Intermediate conclusion (query)
• Is my query specific
enough? including the
“implicit”?
• Are data specific enough
for that query? and does
it fit the “implicit”?
• Data with same specificity
may not fit questions
whose specificity differs.
Examples
• (1) Paris metro stations:
– “rue”: broad (line) OK
– “metro” implicit point notOK
because
implicit assumptions differ
• (2) Students brain drain:
– “how many staying?”: OK
– implicit: “is it a lot?” notOK
because
no baseline
Intermediate conclusion (query)
• Is my query specific
enough? including the
“implicit”?
• Are data specific enough
for that query? and does
it fit the “implicit”?
• Data with same specificity
may not fit questions
whose specificity differs.
Examples
• (1) Paris metro stations:
– “rue”: broad (line) OK
– “metro” implicit point notOK
because
implicit assumptions differ
• (2) Students brain drain:
– “how many staying?”: OK
– implicit: “is it a lot?” notOK
because
no baseline
Quality in act
is
being
(adequately)
specific
remember: « nothing but the truth »
is not enough
Data, Processing and Quality
• Let’s consider the simplest and most popular
processing ever: computing an average value
with the arithmetic mean
Media are fond of the mean.
To be honest, they know that the median
may, sometimes, be better. (but rarely
published)
What the (mean) mean means?
General feeling is to link the mean and the
middle (50-50)
• Let’s consider a situation modeled by a
relation (ex: pupils and classrooms): it can be
inspected from two points of view: does it
mean that the mean is the same in both
points of view?
• That’s what we expect (general feeling: the
mean is independent from the viewpoint)
What the (mean) mean means?
• Ex. pupils in schools:
– how many pup by classroom? (average = mean or median)
– both mean and median are accepted as
meaningful (at least for somehow “regular” data)
Two points of view: from classroom (lawmaker), and
from pupil (parents):
Pc = Proba(classroom c, classSize(c) ≥ average), expected = 0.50
Pp = Proba(pupil pϵc, classSize(c) ≥ average), expected = 0.50
Question1 : does Pc = Pp?
Easy stats #1: mean
• UK government publishes annual
arithmetic mean of the number
of pupils per class:
• 2010 average class size: mean = 27.8
stable from previous year
• Number of pupils (primary): 3 850 000
The breakdown of classes by number
of pupils is seldom surveyed.
But UK introduced a new law in 2001.
Easy stats #1: mean
• Evolution of class size: Pc
Introduction of the limit of
30 per class in 2001 
special report providing Pp
data!
Easy stats #1: mean
• Looking closer at
the report
remember: mean = 27.8
rapid reading of the graph:
65% in [28,+[
(41+13+8+2+a few more)
hence:
Pp = 0.65
not 0.50
Question 2:
Is the median better?
mean
Easy stats #1: mean
• Answering questions 1&2
with my students …
a JavaScript animation:
allow 10% of all classes to send pupils to other
classes, under 2 constraints:
sizeMax ≤ 30 and sizeMin ≥ 10
Easy stats #1: mean
• Answering questions 1&2
with my students …
a JavaScript animation:
allow 10% of all classes to send pupils to other
classes, under 2 constraints:
sizeMax ≤ 30 and sizeMin ≥ 10
Intermediate conclusion (averages)
(1) Relations (eg. membership, part_of) are two sided
different actors may have different visions of the same
relation (membership feeling)
(2) go beyond the single moments (mean, median)
question the non regular form of the probability density
function (pdf) and the existence of possible extreme
values (outliers)
Quality in act!
Whenever possible: change the viewpoint, take
a step back
Less easy stats (extremes)
• The issue of measuring extreme
inequalities is around for several
decades*.
(*) e.g. Centre d’Analyse et Mathématiques
Sociales (CAMS, EHESS, Paris)
One more influencer
Marc Barbut, CAMS cofounder,
introduced Pareto among French sociologists,
and was a leading voice in the measurement of
inequalities.
Less easy stats (extremes)
• Change viewpoint second example: the
relation population-income
inter-quantile intervals (equi-population) ↔
equidistant intervals of income (equi-revenue).
• The most popular index for measuring inequalities
is the Gini coefficient.
• It can be computed equally for both viewpoints:
population or income,
but visualization is quite different.
Less easy stats (Gini)
• Gini coefficient with 1 data:
If richest u % of population (red) equally
share f % of all income,
G = f − u.
• Gini coefficient with n data:
G =
Gini is so popular for measuring income inequalities, that you can find
“the Gini” of any country (without mention of income data, cf. wikipedia).
Gini can be applied to school/pupils as well, or any kind of resources.
xi is the income shared by
the i-th fraction of the
population (with fractions
of equi-population)
Less easy stats (Gini)
• Where is the quality issue?
What is not mentioned with the Gini:
1. its sensitivity to the scale of the fractions.
2. which fractions of the population are used: deciles?
percentiles? (I couldn't find it in OECD docs).
A simple demonstration: Split a total income : 100
with the median alone: gini( 5, 95 ) = 0.45
with the 3 quartiles: gini( 2,3, 3,92 ) = 0.67
and deciles: gini( 1,1, 1,1,1, 1,1,1, 92 ) = 0.81
– http://www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
Less easy stats (Gini)
• and a real demonstration (French incomes 2007, after T. Piketti)
equi-population quartiles, Gini = 0.4271
equi-population deciles, Gini = 0.4683
equi-population percentiles, Gini = 0.4812
99% 1%
Less easy stats (Gini)
• and now with equi-revenue brackets (GMI=guaranted minimal income)
the 99% paradigm: starts at 5 x GMI …
at 10 x GMI, Gini = 0.6072 … and counting …
99% 1%
Line
10xGMI
Second lab.
Less easy stats (Gini)
• and beyond
the 99.9% paradigm: starts at 15 x GMI …
and at 100 x GMI, Gini = 0.9406 … and counting …
99.9% 0.1%
Line
10xGMI
Line
100xGMI
Line
10xGMI
Less easy stats (Gini)
• Quality issues: Resampling.
The most difficult part –controversial?- is to compute equal
size income brackets (equi-revenue) from percentiles of
population:
Aggregation
Disaggregation
Aggregation : many-to-one
Disaggregation: one-to-many
NOIR arithmetic's rules
(Nominal-Ordinal-Interval-Ratio):
differ depending on:
-is population represented as equi-population (I), or not (O)
-is income represented as amount (O), equi-revenue (I), or %share (R)
Less easy stats (Gini)
NOIR levels of measurement, arithmetic's rules:
– Nominal: Simply naming variables. Ex. gender (Male, Female)
– Ordinal: Nominal + rank. Ex. letter grades (A, B, C, D, F)
Order relation is possible, min, max.
– Interval: Ordinal + distance (equal, log-equal).
We can compute sums, differences, means
– Ratio: Interval + zero. We can multiply/divide.
Refs:
• Stanley Smith Stevens (1946, psychology) revisited by several authors
N.Chrisman (1998, geography) A. Wolman (2006, Measurement in Conservation Science)
Less easy stats (Gini)
• Quality issue with: Re-sampling: the additional error is
probably in the error margin of the primary data (declared
income).
• Quality issue with: Removing outliers / or Not. Highest
incomes are generally merged with the highest decile, or
percentile. In that particular dataset, the highest percentile
is dis-aggregated (anonymously: PSE work). And YES, a few
hundreds of incomes are above 1000 time the GMI.
Quality: some acceptable loss in accuracy
Quality: additional brackets are estimated
(real data are confidential)
What’s an Outlier?
“an observation that deviates so much from
other observations as to arouse suspicion that it
was generated by a different mechanism.” gene.
(Hawkins, 1980)
Some Outliers
(www.forbes.com:
top 20 fortunes)
Intermediate conclusion (extremes)
• Is it correct to change the viewpoint?
What’s correct is to present both viewpoints.
• Is it correct to transform data through
aggregation-disaggregation?
What’s correct is to inform about the quality loss
• Is it correct to not ignore “extremes” as “outliers”?
What’s correct is to inform about the quality gain
when keeping them.
Quality in act!
Extremes are
not statistical
outliers
Quality in act
is:
being aware of
the context
Quality in act
in the
Big Data context:
an unmonitored
chain of queries
Quality in act
is
being specific
summing-up the intermediates
Quality in act!
Change viewpoint
Take a step back
Quality in act:
extremes are
not necessarily
statistical
outliers
Quality in act: and what about
Scale?
Yet another influencer (scale)
• Why Is Scale an Effective
Descriptor for Data Quality?
• Andrew U. Frank (2009)
My apologies to my other influencers not cited in this presentation, many
are collectively cited with the REVIGIS project.
(L. Zadeh. “Some Reflections on Information
Granulation and its Centrality in Granular Computing”,
cited by Andrew in that paper
• Brain drain example:
Part-whole relation matters
• PISA example:
(Shannon-Nyquist) sampling theorem matters
ChinaItaly
All
countries
74%
Germany
Scale and Quality
France
70%
2003 20122006 2009
• Query chain example:
Specificity matters
• Extremes example:
(Shannon-Nyquist) sampling theorem matters again
Scale and Quality
Metro
station
is a point
99%
over-sampled
1%
under-
sampled
Equi-population percentiles
99.9% 0.1%
Equi-revenue after re-sampling
Last Conclusion
• Back to the initial question:
does it fit your needs?
• Do we really care about Quality?
We do care about Needs
then (in case of doubt), about Quality.
People acts according to their needs,
if quality fits trust, why to go beyond?
In other words:
if the “nothing but the truth” is respected,
and needs are satisfied,
why to ask for the “whole truth”?
(… a scientist obsession!)
Very Last (personal) Conclusion
• Being wrong is not enough! (to bring corrections)
Because needs were fitted!
• Being right is not enough! (to convince user in error)
Because needs were fitted!
The quality of the demonstration is the key:
many examples:
climate change!
archaeology, art market, social indices, …
Quality in act
is
Ability to Convince
Merci
for the Quality
of your attention
Annex: Open questions.
• What metric best leverages the available data?
– Does the user query require the metric to have particular
properties (some factor resistance, comparability across groups,
etc)?
• Big data introduces new facets for quality:
– queries can be sent to several services,
– time lags can differ, and asynchronous processing can proceed
in an unexpected sequence,
– not all data are updated in the same time,
– incomplete results can be returned,
– Redundancy is everywhere in the Nature: it’s a protection
against “outliers”. Big Data can be a source of redundancy.
– Open data: remember: you never know what people will do
with your data
Yet another influencer
• Darrell Huff
(Published in 1954)
Not necessarily intentional lies:
– correlation and implicit causality,
– goal-oriented visualization,
– etc.

More Related Content

What's hot

TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...selahattin nisanoglu
 
Anti-plagiarism tools for our repositories
Anti-plagiarism tools for our repositoriesAnti-plagiarism tools for our repositories
Anti-plagiarism tools for our repositoriesJan Mach
 
Unsupervised Word Usage Similarity in Social Media Texts
Unsupervised Word Usage Similarity in Social Media TextsUnsupervised Word Usage Similarity in Social Media Texts
Unsupervised Word Usage Similarity in Social Media TextsSpandana Gella
 
Analyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experienceAnalyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experiencePavel Serdyukov
 
Inf 103 Enhance teaching / snaptutorial.com
Inf 103 Enhance teaching / snaptutorial.comInf 103 Enhance teaching / snaptutorial.com
Inf 103 Enhance teaching / snaptutorial.comBaileya16
 
Meyer dig ethno_2013sdp
Meyer dig ethno_2013sdpMeyer dig ethno_2013sdp
Meyer dig ethno_2013sdpEric Meyer
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?Frank van Harmelen
 
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...Micah Altman
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignJonathan Stray
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)Frank van Harmelen
 
The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)Frank van Harmelen
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsMatt Christy
 
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Rich Heimann
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayEuropeana Newspapers
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Rich Heimann
 

What's hot (19)

TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
 
Anti-plagiarism tools for our repositories
Anti-plagiarism tools for our repositoriesAnti-plagiarism tools for our repositories
Anti-plagiarism tools for our repositories
 
Unsupervised Word Usage Similarity in Social Media Texts
Unsupervised Word Usage Similarity in Social Media TextsUnsupervised Word Usage Similarity in Social Media Texts
Unsupervised Word Usage Similarity in Social Media Texts
 
Analyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experienceAnalyzing behavioral data for improving search experience
Analyzing behavioral data for improving search experience
 
Inf 103 Enhance teaching / snaptutorial.com
Inf 103 Enhance teaching / snaptutorial.comInf 103 Enhance teaching / snaptutorial.com
Inf 103 Enhance teaching / snaptutorial.com
 
Meyer dig ethno_2013sdp
Meyer dig ethno_2013sdpMeyer dig ethno_2013sdp
Meyer dig ethno_2013sdp
 
The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?The Web of Data: do we actually understand what we built?
The Web of Data: do we actually understand what we built?
 
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
MIT Program on Information Science Talk -- Julia Flanders on Jobs, Roles, Ski...
 
Big Data @ CBS
Big Data @ CBSBig Data @ CBS
Big Data @ CBS
 
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
eMOP-PennSt-lunch
eMOP-PennSt-luncheMOP-PennSt-lunch
eMOP-PennSt-lunch
 
The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)The end of the scientific paper as we know it (or not...)
The end of the scientific paper as we know it (or not...)
 
The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)The end of the scientific paper as we know it (in 4 easy steps)
The end of the scientific paper as we know it (in 4 easy steps)
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-tools
 
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
 
Présentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information DayPrésentation Günter Mühlberger, BnF Information Day
Présentation Günter Mühlberger, BnF Information Day
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 

Similar to Does Data Quality lays in facts, or in acts?

Oberski EAM 2018 - Incidental data for serious social research
Oberski EAM 2018 - Incidental data for serious social researchOberski EAM 2018 - Incidental data for serious social research
Oberski EAM 2018 - Incidental data for serious social researchDaniel Oberski
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).pptSanjayAcharaya
 
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
Analysing a Complex Agent-Based Model  Using Data-Mining TechniquesAnalysing a Complex Agent-Based Model  Using Data-Mining Techniques
Analysing a Complex Agent-Based Model Using Data-Mining TechniquesBruce Edmonds
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collectiondnac
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curatorbenosteen
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Datajonblower
 
Introduction to the Venice Time Machine
Introduction to the Venice Time MachineIntroduction to the Venice Time Machine
Introduction to the Venice Time MachineGiovanni Colavizza
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSara-Jayne Terp
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Data/Visualization - Digital Center Cohort - 13_0222
Data/Visualization - Digital Center Cohort - 13_0222Data/Visualization - Digital Center Cohort - 13_0222
Data/Visualization - Digital Center Cohort - 13_0222jeffreylancaster
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for developmentSara-Jayne Terp
 
Data matters-bournemouth-2015
Data matters-bournemouth-2015Data matters-bournemouth-2015
Data matters-bournemouth-2015Alan Dix
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Marco Brambilla
 
TL_Thompson.pptx.ppt
TL_Thompson.pptx.pptTL_Thompson.pptx.ppt
TL_Thompson.pptx.pptRGowthamRao
 

Similar to Does Data Quality lays in facts, or in acts? (20)

Oberski EAM 2018 - Incidental data for serious social research
Oberski EAM 2018 - Incidental data for serious social researchOberski EAM 2018 - Incidental data for serious social research
Oberski EAM 2018 - Incidental data for serious social research
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Rogers digitalmethods 4nov2010
Rogers digitalmethods 4nov2010Rogers digitalmethods 4nov2010
Rogers digitalmethods 4nov2010
 
Data Science-1 (1).ppt
Data Science-1 (1).pptData Science-1 (1).ppt
Data Science-1 (1).ppt
 
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
Analysing a Complex Agent-Based Model  Using Data-Mining TechniquesAnalysing a Complex Agent-Based Model  Using Data-Mining Techniques
Analysing a Complex Agent-Based Model Using Data-Mining Techniques
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
02 Network Data Collection
02 Network Data Collection02 Network Data Collection
02 Network Data Collection
 
02 Network Data Collection (2016)
02 Network Data Collection (2016)02 Network Data Collection (2016)
02 Network Data Collection (2016)
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curator
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
Introduction to the Venice Time Machine
Introduction to the Venice Time MachineIntroduction to the Venice Time Machine
Introduction to the Venice Time Machine
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Data/Visualization - Digital Center Cohort - 13_0222
Data/Visualization - Digital Center Cohort - 13_0222Data/Visualization - Digital Center Cohort - 13_0222
Data/Visualization - Digital Center Cohort - 13_0222
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
 
Data matters-bournemouth-2015
Data matters-bournemouth-2015Data matters-bournemouth-2015
Data matters-bournemouth-2015
 
Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...Myths and challenges in knowledge extraction and analysis from human-generate...
Myths and challenges in knowledge extraction and analysis from human-generate...
 
TL_Thompson.pptx.ppt
TL_Thompson.pptx.pptTL_Thompson.pptx.ppt
TL_Thompson.pptx.ppt
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

Does Data Quality lays in facts, or in acts?

  • 1. Does Data Quality lays in facts, or in acts? A journey to the country of data use, abuse and misuse Robert Jeansoulin, emeritus CNRS, Univ. Paris-Est
  • 2. Workshop question • Quality assessment of geospatial data: does it fit your needs? “needs” are the thread to follow for answering that question. • Data producer: Quality is part of his product (explicit responsibility to respect its own specifications). internal quality IQ (=product quality) Pristine Quality • User: Quality ? Does he know that he needs it? Depends on the context: a purpose (possibly implicit) + available data (+ associated quality) + formulation of queries + computations (multiple diluted responsibility) external quality (=contextual quality) Stained Quality
  • 3. µ values, for instance … in some … in context another one Early influencers Henry Prade (Lotfi Zadeh’s PhD) on: Possibility Theory and “Membership” functions (µ) PhD. U.Toulouse, 1980 L. Zadeh described himself as “{an American}, {mathematically oriented}, {electrical engineer} {of Iranian descent}, {born in Russia}.” µ(Us)=.9 µ(m )=.7 µ(ee)=.8 µ(Iran)=.6 µ(Rus)=.4 =.8 =.8 =.4 =.8 =.6 µ values, for instance … in some context
  • 4. µ values, for instance … in some … in context another one Early influencers Henry Prade (Lotfi Zadeh’s PhD) L. Zadeh described himself as “{an American}, {mathematically oriented}, {electrical engineer} {of Iranian descent}, {born in Russia}.” µ(Us)=.9 µ(m )=.7 µ(ee)=.8 µ(Iran)=.6 µ(Rus)=.4 =.8 =.8 =.4 =.8 =.6 µ values, for instance … in some context (flag) Possibility Theory and Membership functions
  • 5. µ values, for instance … in some … in context another context Early influencers Henry Prade (Lotfi Zadeh’s PhD) L. Zadeh described himself as “{an American}, {mathematically oriented}, {electrical engineer} {of Iranian descent}, {born in Russia}.” µ(Us)=.9 µ(m )=.6 µ(ee)=.8 µ(Iran)=.5 µ(Rus)=.4 =.6 =.8 =.4 =.8 =.6 µ values, for instance … in some context Possibility Theory and Membership functions
  • 6. Early influencers first met in Luxembourg, 1990 Pete Fisher: conversations on “Activating quality”, … which we put in practice … Henry Lotfi Prade Zadeh Possibility Theory and Membership functions
  • 7. Early influencers Henry Prade (Lotfi Zadeh’s PhD): “membership functions” Toulouse, 1980 Brussels, 1990 Pete Fisher: “activating quality”. … we did “activate quality” in the REVIGIS project 1999-2004 at ITC
  • 8. Recent influences A journey to the country of data use, abuse and misuse When a data scientist frequents the circle of politicians, diplomats and lawmakers, it could be a Mark Twain like story … thou shalt say the (whole) truth and nothing but the truth … striving to comply with this rule:
  • 9. Recent influences A journey to the country of data use, abuse and misuse … finding eventually that half is enough: thou shalt say the (whole) truth and nothing but the truth(*) In King Arthur’s court, “whole truth” = as much as necessary, not more
  • 10. Recent influences Is “nothing but the truth” enough? The frequentation of King Arthur’s court can help a data scientist to deepen its thoughts about needs and goals, facts and acts, quality and conviction etc. and ask questions such as … • Quality in the facts versus Quality in the acts? • Quality in the data versus Fitting the queries? • Quality in the representation versus Faith in the processing?
  • 11. Quality of one data • Consider this table* published by the NSF: * Doctorate recipients with temporary visas intending to stay in the USA (excerpt from mandatory forms filled by all PhD students when defending thesis) France 2007 Staying in the USA after PhD 69.7 %French PhD 2007: We are in the era of Big Data, but let’s play it modest first, only one data
  • 12. Quality of one data In 2009 a think tank used that figure, rounded to 70% as the “key data” of this report (cf. website) The report got a pretty good media coverage
  • 13. Where is the quality issue? Not in the accuracy. What does 70% imply? (rising media interest!) is it “a lot”? … without a baseline … “yes” in general But, let’s say: “about 50% of newborns are boys” everybody knows: it’s not “a lot”, it’s just “usual”. Quality of one data Without more data, just one is useless for a comparison, no matter the accuracy
  • 14. Quality of one data 74.3 % Much ado about nothing! Brain drain doesn’t threat France! Report=garbage investigating deeper the NSF table year 2007: All “visa” PhDs intending to stay =
  • 15. Quality of two data (evolution) • OECD PISA surveys (Prog. International Students Assessment) – Every 3 years since 2003 – each survey is companioned by an OECD report which focus on so-called key points. • PISA 2012 survey: – 2012 companion report focus is: to enhance comparison with 2003. (Publication: Dec.3rd, 2012) – impact in France: alert signal from Ministry of Education = bad results!
  • 16. Quality of two data (evolution) • PISA surveys (Prog. International Students Assessment) – Every 3 years since 2003 • 2012 survey – Publication: Dec.3rd, 2012 – Since September: alert signal from Ministry of Education: bad results for France!
  • 17. 450 460 470 480 490 500 510 520 2003 2012 Czech Rep. Iceland Denmark France Sweden Germany Ireland Quality of two data (evolution) • PISA2012 France: score deteriorates in math's Alert! From Ministry 2 dates … while improving nicely in Germany
  • 18. 450 460 470 480 490 500 510 520 2003 2006 2009 2012 Czech Rep. Iceland Denmark France Sweden Germany Ireland Quality of two data (evolution) 4 dates Alert! From Ministry Now, look at the full dataset (intermediate dates 2006-2009) • Where is the quality issue?
  • 19. 450 460 470 480 490 500 510 520 2003 2006 2009 2012 Czech Rep. Iceland Denmark France Sweden Germany Ireland Quality of two data (evolution) When it really happened: 6 years earlier! 4 dates Alert! From Ministry • Where is the quality issue? alert=garbage
  • 20. 450 460 470 480 490 500 510 520 2003 2006 2009 2012 Czech Rep. Iceland Denmark France Sweden Germany Ireland Not in the accuracy, Not in the reliability (OECD data) …… just in the laziness of the report writers and editors: all data where in the tables, in open access. • Where is the quality issue? Quality of two data (evolution)
  • 21. Intermediate conclusion • “Nothing but the truth” is not enough! Between 1 or 2 data, and Big data … it’s worth spending time to read a few more data. Beware: data carry a lot of implicit context: – origin, scale, – adjacent data: time intervals, – broader neighborhood: part-whole relationship – … it looks like a “mapping” process! Learn to read datasets, as you learn to read maps and read as much as possible. Quality in act is: being aware of the context
  • 22. a puzzling experience with a .com Queries: ?:“rue de Miromesnil” answer: pretty close to my real destination (as I’ll could see later) ?:“metro nearby” (service option) answer: looks about 300m from target, seems Ok.
  • 23. a puzzling experience with a .com Queries: ?:“rue de Miromesnil” answer: pretty close to my real destination (as I’ll could see later) ?:“metro nearby” (service option) answer: looks about 300m from target, seems Ok. unfortunately…station Havre-Caumartin is not there! …but about 1km more East! Is my “.com” site really dumb?
  • 24. investigating further! • “rue de Miromesnil”: no street # (that’s the user query) • “métro”: no street # (why?) • An automatic query was generated to: ratp.fr (official metro website)
  • 25. what’s going on with ratp.fr… Same queries on “ratp.fr” site:  “rue de Miromesnil” (without number): answers a different location, not in the middle but near beginning of the street  “metro” nearby: station Miromesnil (it exists!) Different protocols: street without a number is located near #1, by RATP.fr near street middle by the .com
  • 26. Query chain in Big Data context Answers vary depending on the way the query is built: 1 4 3 1. “.com” directly answers query “rue Miromesnil” 2. “.com” delegates “metro” query to ratp.fr (specialized) 3. “ratp” answers street address without # 4. “.com” assumes street middle (doesn’t check if it complies) 5. Proof: ask “boul. Hausmann” in step 1, gives you a perfect match (… which doesn’t fit needs)  2 ratp
  • 27. Query chain in Big Data context Answers vary depending on the way the query is built: 1 4 3 1. “.com” directly answers query “rue Miromesnil” 2. “.com” delegates “metro” query to ratp.fr (specialized) 3. “ratp” answers street address without # 4. “.com” assumes street middle (doesn’t check if it complies) 5. Proof: ask “boul. Hausmann” in step 1, gives you a perfect match (… which doesn’t fit needs)  2 ratp Separate queries are correctly handled, but: the assumption of same protocols and ontologies is wrong! Quality in act in the Big Data context = an unmonitored chain of queries
  • 28. Intermediate conclusion (query) • Is my query specific enough? including the “implicit”? • Are data specific enough for that query? and does it fit the “implicit”? • Data with same specificity may not fit questions whose specificity differs. Examples • (1) Paris metro stations: – “rue”: broad (line) OK – “metro” implicit point notOK because implicit assumptions differ • (2) Students brain drain: – “how many staying?”: OK – implicit: “is it a lot?” notOK because no baseline
  • 29. Intermediate conclusion (query) • Is my query specific enough? including the “implicit”? • Are data specific enough for that query? and does it fit the “implicit”? • Data with same specificity may not fit questions whose specificity differs. Examples • (1) Paris metro stations: – “rue”: broad (line) OK – “metro” implicit point notOK because implicit assumptions differ • (2) Students brain drain: – “how many staying?”: OK – implicit: “is it a lot?” notOK because no baseline Quality in act is being (adequately) specific remember: « nothing but the truth » is not enough
  • 30. Data, Processing and Quality • Let’s consider the simplest and most popular processing ever: computing an average value with the arithmetic mean Media are fond of the mean. To be honest, they know that the median may, sometimes, be better. (but rarely published)
  • 31. What the (mean) mean means? General feeling is to link the mean and the middle (50-50) • Let’s consider a situation modeled by a relation (ex: pupils and classrooms): it can be inspected from two points of view: does it mean that the mean is the same in both points of view? • That’s what we expect (general feeling: the mean is independent from the viewpoint)
  • 32. What the (mean) mean means? • Ex. pupils in schools: – how many pup by classroom? (average = mean or median) – both mean and median are accepted as meaningful (at least for somehow “regular” data) Two points of view: from classroom (lawmaker), and from pupil (parents): Pc = Proba(classroom c, classSize(c) ≥ average), expected = 0.50 Pp = Proba(pupil pϵc, classSize(c) ≥ average), expected = 0.50 Question1 : does Pc = Pp?
  • 33. Easy stats #1: mean • UK government publishes annual arithmetic mean of the number of pupils per class: • 2010 average class size: mean = 27.8 stable from previous year • Number of pupils (primary): 3 850 000 The breakdown of classes by number of pupils is seldom surveyed. But UK introduced a new law in 2001.
  • 34. Easy stats #1: mean • Evolution of class size: Pc Introduction of the limit of 30 per class in 2001  special report providing Pp data!
  • 35. Easy stats #1: mean • Looking closer at the report remember: mean = 27.8 rapid reading of the graph: 65% in [28,+[ (41+13+8+2+a few more) hence: Pp = 0.65 not 0.50 Question 2: Is the median better? mean
  • 36. Easy stats #1: mean • Answering questions 1&2 with my students … a JavaScript animation: allow 10% of all classes to send pupils to other classes, under 2 constraints: sizeMax ≤ 30 and sizeMin ≥ 10
  • 37. Easy stats #1: mean • Answering questions 1&2 with my students … a JavaScript animation: allow 10% of all classes to send pupils to other classes, under 2 constraints: sizeMax ≤ 30 and sizeMin ≥ 10
  • 38. Intermediate conclusion (averages) (1) Relations (eg. membership, part_of) are two sided different actors may have different visions of the same relation (membership feeling) (2) go beyond the single moments (mean, median) question the non regular form of the probability density function (pdf) and the existence of possible extreme values (outliers) Quality in act! Whenever possible: change the viewpoint, take a step back
  • 39. Less easy stats (extremes) • The issue of measuring extreme inequalities is around for several decades*. (*) e.g. Centre d’Analyse et Mathématiques Sociales (CAMS, EHESS, Paris) One more influencer Marc Barbut, CAMS cofounder, introduced Pareto among French sociologists, and was a leading voice in the measurement of inequalities.
  • 40. Less easy stats (extremes) • Change viewpoint second example: the relation population-income inter-quantile intervals (equi-population) ↔ equidistant intervals of income (equi-revenue). • The most popular index for measuring inequalities is the Gini coefficient. • It can be computed equally for both viewpoints: population or income, but visualization is quite different.
  • 41. Less easy stats (Gini) • Gini coefficient with 1 data: If richest u % of population (red) equally share f % of all income, G = f − u. • Gini coefficient with n data: G = Gini is so popular for measuring income inequalities, that you can find “the Gini” of any country (without mention of income data, cf. wikipedia). Gini can be applied to school/pupils as well, or any kind of resources. xi is the income shared by the i-th fraction of the population (with fractions of equi-population)
  • 42. Less easy stats (Gini) • Where is the quality issue? What is not mentioned with the Gini: 1. its sensitivity to the scale of the fractions. 2. which fractions of the population are used: deciles? percentiles? (I couldn't find it in OECD docs). A simple demonstration: Split a total income : 100 with the median alone: gini( 5, 95 ) = 0.45 with the 3 quartiles: gini( 2,3, 3,92 ) = 0.67 and deciles: gini( 1,1, 1,1,1, 1,1,1, 92 ) = 0.81 – http://www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator
  • 43. Less easy stats (Gini) • and a real demonstration (French incomes 2007, after T. Piketti) equi-population quartiles, Gini = 0.4271 equi-population deciles, Gini = 0.4683 equi-population percentiles, Gini = 0.4812 99% 1%
  • 44. Less easy stats (Gini) • and now with equi-revenue brackets (GMI=guaranted minimal income) the 99% paradigm: starts at 5 x GMI … at 10 x GMI, Gini = 0.6072 … and counting … 99% 1% Line 10xGMI Second lab.
  • 45. Less easy stats (Gini) • and beyond the 99.9% paradigm: starts at 15 x GMI … and at 100 x GMI, Gini = 0.9406 … and counting … 99.9% 0.1% Line 10xGMI Line 100xGMI Line 10xGMI
  • 46. Less easy stats (Gini) • Quality issues: Resampling. The most difficult part –controversial?- is to compute equal size income brackets (equi-revenue) from percentiles of population: Aggregation Disaggregation Aggregation : many-to-one Disaggregation: one-to-many NOIR arithmetic's rules (Nominal-Ordinal-Interval-Ratio): differ depending on: -is population represented as equi-population (I), or not (O) -is income represented as amount (O), equi-revenue (I), or %share (R)
  • 47. Less easy stats (Gini) NOIR levels of measurement, arithmetic's rules: – Nominal: Simply naming variables. Ex. gender (Male, Female) – Ordinal: Nominal + rank. Ex. letter grades (A, B, C, D, F) Order relation is possible, min, max. – Interval: Ordinal + distance (equal, log-equal). We can compute sums, differences, means – Ratio: Interval + zero. We can multiply/divide. Refs: • Stanley Smith Stevens (1946, psychology) revisited by several authors N.Chrisman (1998, geography) A. Wolman (2006, Measurement in Conservation Science)
  • 48. Less easy stats (Gini) • Quality issue with: Re-sampling: the additional error is probably in the error margin of the primary data (declared income). • Quality issue with: Removing outliers / or Not. Highest incomes are generally merged with the highest decile, or percentile. In that particular dataset, the highest percentile is dis-aggregated (anonymously: PSE work). And YES, a few hundreds of incomes are above 1000 time the GMI. Quality: some acceptable loss in accuracy Quality: additional brackets are estimated (real data are confidential)
  • 49. What’s an Outlier? “an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” gene. (Hawkins, 1980) Some Outliers (www.forbes.com: top 20 fortunes)
  • 50. Intermediate conclusion (extremes) • Is it correct to change the viewpoint? What’s correct is to present both viewpoints. • Is it correct to transform data through aggregation-disaggregation? What’s correct is to inform about the quality loss • Is it correct to not ignore “extremes” as “outliers”? What’s correct is to inform about the quality gain when keeping them. Quality in act! Extremes are not statistical outliers
  • 51. Quality in act is: being aware of the context Quality in act in the Big Data context: an unmonitored chain of queries Quality in act is being specific summing-up the intermediates Quality in act! Change viewpoint Take a step back Quality in act: extremes are not necessarily statistical outliers Quality in act: and what about Scale?
  • 52. Yet another influencer (scale) • Why Is Scale an Effective Descriptor for Data Quality? • Andrew U. Frank (2009) My apologies to my other influencers not cited in this presentation, many are collectively cited with the REVIGIS project. (L. Zadeh. “Some Reflections on Information Granulation and its Centrality in Granular Computing”, cited by Andrew in that paper
  • 53. • Brain drain example: Part-whole relation matters • PISA example: (Shannon-Nyquist) sampling theorem matters ChinaItaly All countries 74% Germany Scale and Quality France 70% 2003 20122006 2009
  • 54. • Query chain example: Specificity matters • Extremes example: (Shannon-Nyquist) sampling theorem matters again Scale and Quality Metro station is a point 99% over-sampled 1% under- sampled Equi-population percentiles 99.9% 0.1% Equi-revenue after re-sampling
  • 55. Last Conclusion • Back to the initial question: does it fit your needs? • Do we really care about Quality? We do care about Needs then (in case of doubt), about Quality. People acts according to their needs, if quality fits trust, why to go beyond? In other words: if the “nothing but the truth” is respected, and needs are satisfied, why to ask for the “whole truth”? (… a scientist obsession!)
  • 56. Very Last (personal) Conclusion • Being wrong is not enough! (to bring corrections) Because needs were fitted! • Being right is not enough! (to convince user in error) Because needs were fitted! The quality of the demonstration is the key: many examples: climate change! archaeology, art market, social indices, … Quality in act is Ability to Convince
  • 57. Merci for the Quality of your attention
  • 58. Annex: Open questions. • What metric best leverages the available data? – Does the user query require the metric to have particular properties (some factor resistance, comparability across groups, etc)? • Big data introduces new facets for quality: – queries can be sent to several services, – time lags can differ, and asynchronous processing can proceed in an unexpected sequence, – not all data are updated in the same time, – incomplete results can be returned, – Redundancy is everywhere in the Nature: it’s a protection against “outliers”. Big Data can be a source of redundancy. – Open data: remember: you never know what people will do with your data
  • 59. Yet another influencer • Darrell Huff (Published in 1954) Not necessarily intentional lies: – correlation and implicit causality, – goal-oriented visualization, – etc.