Does Data Quality lays in facts, or in acts?

Does Data Quality lays
in facts, or in acts?
A journey to the country
of data use, abuse and
misuse
Robert Jeansoulin, emeritus CNRS, Univ. Paris-Est

Workshop question
• Quality assessment of geospatial data:
does it fit your needs?
“needs” are the thread to follow for answering that question.
• Data producer: Quality is part of his product (explicit responsibility
to respect its own specifications).
internal quality IQ (=product quality) Pristine Quality
• User: Quality ? Does he know that he needs it? Depends on the
context: a purpose (possibly implicit) + available data (+ associated
quality) + formulation of queries + computations (multiple diluted
responsibility)
external quality (=contextual quality) Stained Quality

µ values, for instance … in some … in
context another one
Early influencers
Henry Prade (Lotfi Zadeh’s PhD)
on: Possibility Theory and
“Membership” functions (µ)
PhD. U.Toulouse, 1980
L. Zadeh described himself as
“{an American},
{mathematically oriented},
{electrical engineer}
{of Iranian descent},
{born in Russia}.”
µ(Us)=.9
µ(m )=.7
µ(ee)=.8
µ(Iran)=.6
µ(Rus)=.4
=.8
=.8
=.4
=.8
=.6
µ values, for instance … in some
context

context another one
Early influencers
Henry
Prade (Lotfi Zadeh’s PhD)
“{an American},
µ(Us)=.9
µ(m )=.7
µ(ee)=.8
µ(Iran)=.6
µ(Rus)=.4
=.8
=.8
=.4
=.8
=.6
context (flag)
Possibility Theory and Membership functions

context another context
Early influencers
Henry
Prade (Lotfi Zadeh’s PhD)
“{an American},
µ(Us)=.9
µ(m )=.6
µ(ee)=.8
µ(Iran)=.5
µ(Rus)=.4
=.6
=.8
=.4
=.8
=.6
context

Early influencers
first met in Luxembourg, 1990
Pete Fisher:
conversations on “Activating quality”,
… which we put in practice …
Henry Lotfi
Prade Zadeh

Early influencers
Henry Prade (Lotfi Zadeh’s PhD):
“membership functions”
Toulouse, 1980
Brussels, 1990
Pete Fisher:
“activating quality”.
… we did “activate quality” in the REVIGIS project 1999-2004
at ITC

Recent influences
A journey to the country of
data use, abuse and misuse
When a data scientist frequents the circle of
politicians, diplomats and lawmakers, it
could be a Mark Twain like story …
thou shalt say
the (whole)
truth and
nothing but
the truth
… striving to comply with this rule:

Recent influences
A journey to the country of
data use, abuse and misuse
… finding eventually that half is enough:
thou shalt say
the (whole)
truth and
nothing but
the truth(*) In King Arthur’s court, “whole truth” = as much as necessary, not more

Recent influences
Is “nothing but the truth” enough?
The frequentation of King Arthur’s court can help a
data scientist to deepen its thoughts about needs and
goals, facts and acts, quality and conviction etc. and
ask questions such as …
• Quality in the facts
versus Quality in the acts?
• Quality in the data
versus Fitting the queries?
• Quality in the representation
versus Faith in the processing?

Quality of one data
• Consider this table*
published by the NSF:
* Doctorate recipients with temporary
visas intending to stay in the USA
(excerpt from mandatory forms filled by
all PhD students when defending thesis) France
2007
Staying in the
USA after PhD
69.7 %French PhD 2007:
We are in the era of Big Data, but let’s play it modest first, only one data

Quality of one data
In 2009 a think tank used
that figure, rounded to
70% as the “key data” of
this report (cf. website)
The report got a pretty good media coverage

Where is the quality issue? Not in the accuracy.
What does 70% imply? (rising media interest!)
is it “a lot”? … without a baseline … “yes” in general
But, let’s say: “about 50% of newborns are boys”
everybody knows: it’s not “a lot”, it’s just “usual”.
Quality of one data
Without more data, just one is useless for a comparison,
no matter the accuracy

Quality of one data
74.3 % Much ado about
nothing!
Brain drain doesn’t
threat France!
Report=garbage
investigating deeper the
NSF table year 2007:
All “visa” PhDs intending to
stay =

Quality of two data (evolution)
• OECD PISA surveys (Prog. International Students Assessment)
– Every 3 years since 2003
– each survey is companioned by an OECD report
which focus on so-called key points.
• PISA 2012 survey:
– 2012 companion report focus is: to enhance
comparison with 2003. (Publication: Dec.3rd, 2012)
– impact in France: alert signal from Ministry of
Education = bad results!

• PISA surveys (Prog. International Students Assessment)
– Every 3 years since 2003
• 2012 survey
– Publication: Dec.3rd, 2012
– Since September: alert signal from Ministry of
Education: bad results for France!

450
460
470
480
490
500
510
520
2003 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
• PISA2012 France: score deteriorates in math's
Alert!
From
Ministry
2 dates
… while improving nicely in Germany

450
460
470
480
490
500
510
520
2003 2006 2009 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
4 dates
Alert!
From
Ministry
Now, look at the full dataset
(intermediate dates 2006-2009)
• Where is the quality issue?

450
460
470
480
490
500
510
520
2003 2006 2009 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
When it
really
happened:
6 years
earlier!
4 dates
Alert!
From
Ministry
alert=garbage

450
460
470
480
490
500
510
520
2003 2006 2009 2012
Czech Rep.
Iceland
Denmark
France
Sweden
Germany
Ireland
Not in the accuracy,
Not in the reliability (OECD data)
……
just in the laziness of the report writers and
editors:
all data where in the tables, in open access.

Intermediate conclusion
• “Nothing but the truth” is not enough!
Between 1 or 2 data, and Big data … it’s worth spending
time to read a few more data.
Beware: data carry a lot of implicit context:
– origin, scale,
– adjacent data: time intervals,
– broader neighborhood: part-whole relationship
– … it looks like a “mapping” process!
Learn to read datasets, as you learn to read
maps and read as much as possible.
Quality in act is: being aware of the context

a puzzling experience with a .com
Queries:
?:“rue de Miromesnil”
answer: pretty close to my real
destination (as I’ll could see later)
?:“metro nearby” (service option)
answer: looks about 300m from
target, seems Ok.

a puzzling experience with a .com
Queries:
?:“rue de Miromesnil”
answer: pretty close to my real
destination (as I’ll could see later)
?:“metro nearby” (service option)
answer: looks about 300m from
target, seems Ok.
unfortunately…station
Havre-Caumartin is not there!
…but about 1km more East!
Is my “.com” site really
dumb?

investigating further!
• “rue de Miromesnil”: no street # (that’s the user query)
• “métro”: no street # (why?)
• An automatic query was generated to: ratp.fr (official metro website)

what’s going on with ratp.fr…
Same queries on “ratp.fr” site:
 “rue de Miromesnil” (without number): answers a
different location, not in the middle but near
beginning of the street
 “metro” nearby: station Miromesnil (it exists!)
Different protocols: street without a
number is located near #1, by RATP.fr
near street middle by the .com

Query chain in Big Data context
Answers vary depending on the way the query is built:
1 4
3
1. “.com” directly answers query
“rue Miromesnil”
2. “.com” delegates “metro” query
to ratp.fr (specialized)
3. “ratp” answers street address
without #
4. “.com” assumes street middle
(doesn’t check if it complies)
5. Proof: ask “boul. Hausmann” in
step 1, gives you a perfect match
(… which doesn’t fit needs) 
2
ratp

Query chain in Big Data context
Answers vary depending on the way the query is built:
1 4
3
1. “.com” directly answers query
“rue Miromesnil”
2. “.com” delegates “metro” query
to ratp.fr (specialized)
3. “ratp” answers street address
without #
4. “.com” assumes street middle
(doesn’t check if it complies)
5. Proof: ask “boul. Hausmann” in
step 1, gives you a perfect match
(… which doesn’t fit needs) 
2
ratp
Separate queries are correctly handled, but:
the assumption of same protocols and ontologies is wrong!
Quality in act in the
Big Data context =
an unmonitored chain of queries

Intermediate conclusion (query)
• Is my query specific
enough? including the
“implicit”?
• Are data specific enough
for that query? and does
it fit the “implicit”?
• Data with same specificity
may not fit questions
whose specificity differs.
Examples
• (1) Paris metro stations:
– “rue”: broad (line) OK
– “metro” implicit point notOK
because
implicit assumptions differ
• (2) Students brain drain:
– “how many staying?”: OK
– implicit: “is it a lot?” notOK
because
no baseline

Intermediate conclusion (query)
• Is my query specific
enough? including the
“implicit”?
• Are data specific enough
for that query? and does
it fit the “implicit”?
• Data with same specificity
may not fit questions
whose specificity differs.
Examples
• (1) Paris metro stations:
– “rue”: broad (line) OK
– “metro” implicit point notOK
because
implicit assumptions differ
• (2) Students brain drain:
– “how many staying?”: OK
– implicit: “is it a lot?” notOK
because
no baseline
Quality in act
is
being
(adequately)
specific
remember: « nothing but the truth »
is not enough

Data, Processing and Quality
• Let’s consider the simplest and most popular
processing ever: computing an average value
with the arithmetic mean
Media are fond of the mean.
To be honest, they know that the median
may, sometimes, be better. (but rarely
published)

What the (mean) mean means?
General feeling is to link the mean and the
middle (50-50)
• Let’s consider a situation modeled by a
relation (ex: pupils and classrooms): it can be
inspected from two points of view: does it
mean that the mean is the same in both
points of view?
• That’s what we expect (general feeling: the
mean is independent from the viewpoint)

What the (mean) mean means?
• Ex. pupils in schools:
– how many pup by classroom? (average = mean or median)
– both mean and median are accepted as
meaningful (at least for somehow “regular” data)
Two points of view: from classroom (lawmaker), and
from pupil (parents):
Pc = Proba(classroom c, classSize(c) ≥ average), expected = 0.50
Pp = Proba(pupil pϵc, classSize(c) ≥ average), expected = 0.50
Question1 : does Pc = Pp?

Easy stats #1: mean
• UK government publishes annual
arithmetic mean of the number
of pupils per class:
• 2010 average class size: mean = 27.8
stable from previous year
• Number of pupils (primary): 3 850 000
The breakdown of classes by number
of pupils is seldom surveyed.
But UK introduced a new law in 2001.

Easy stats #1: mean
• Evolution of class size: Pc
Introduction of the limit of
30 per class in 2001 
special report providing Pp
data!

Easy stats #1: mean
• Looking closer at
the report
remember: mean = 27.8
rapid reading of the graph:
65% in [28,+[
(41+13+8+2+a few more)
hence:
Pp = 0.65
not 0.50
Question 2:
Is the median better?
mean

Easy stats #1: mean
• Answering questions 1&2
with my students …
a JavaScript animation:
allow 10% of all classes to send pupils to other
classes, under 2 constraints:
sizeMax ≤ 30 and sizeMin ≥ 10

Intermediate conclusion (averages)
(1) Relations (eg. membership, part_of) are two sided
different actors may have different visions of the same
relation (membership feeling)
(2) go beyond the single moments (mean, median)
question the non regular form of the probability density
function (pdf) and the existence of possible extreme
values (outliers)
Quality in act!
Whenever possible: change the viewpoint, take
a step back

Less easy stats (extremes)
• The issue of measuring extreme
inequalities is around for several
decades*.
(*) e.g. Centre d’Analyse et Mathématiques
Sociales (CAMS, EHESS, Paris)
One more influencer
Marc Barbut, CAMS cofounder,
introduced Pareto among French sociologists,
and was a leading voice in the measurement of
inequalities.

Less easy stats (extremes)
• Change viewpoint second example: the
relation population-income
inter-quantile intervals (equi-population) ↔
equidistant intervals of income (equi-revenue).
• The most popular index for measuring inequalities
is the Gini coefficient.
• It can be computed equally for both viewpoints:
population or income,
but visualization is quite different.

Less easy stats (Gini)
• Gini coefficient with 1 data:
If richest u % of population (red) equally
share f % of all income,
G = f − u.
• Gini coefficient with n data:
G =
Gini is so popular for measuring income inequalities, that you can find
“the Gini” of any country (without mention of income data, cf. wikipedia).
Gini can be applied to school/pupils as well, or any kind of resources.
xi is the income shared by
the i-th fraction of the
population (with fractions
of equi-population)

What is not mentioned with the Gini:
1. its sensitivity to the scale of the fractions.
2. which fractions of the population are used: deciles?
percentiles? (I couldn't find it in OECD docs).
A simple demonstration: Split a total income : 100
with the median alone: gini( 5, 95 ) = 0.45
with the 3 quartiles: gini( 2,3, 3,92 ) = 0.67
and deciles: gini( 1,1, 1,1,1, 1,1,1, 92 ) = 0.81
– http://www.peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator

• and a real demonstration (French incomes 2007, after T. Piketti)
equi-population quartiles, Gini = 0.4271
equi-population deciles, Gini = 0.4683
equi-population percentiles, Gini = 0.4812
99% 1%

• and now with equi-revenue brackets (GMI=guaranted minimal income)
the 99% paradigm: starts at 5 x GMI …
at 10 x GMI, Gini = 0.6072 … and counting …
99% 1%
Line
10xGMI
Second lab.

• and beyond
the 99.9% paradigm: starts at 15 x GMI …
and at 100 x GMI, Gini = 0.9406 … and counting …
99.9% 0.1%
Line
10xGMI
Line
100xGMI
Line
10xGMI

• Quality issues: Resampling.
The most difficult part –controversial?- is to compute equal
size income brackets (equi-revenue) from percentiles of
population:
Aggregation
Disaggregation
Aggregation : many-to-one
Disaggregation: one-to-many
NOIR arithmetic's rules
(Nominal-Ordinal-Interval-Ratio):
differ depending on:
-is population represented as equi-population (I), or not (O)
-is income represented as amount (O), equi-revenue (I), or %share (R)

NOIR levels of measurement, arithmetic's rules:
– Nominal: Simply naming variables. Ex. gender (Male, Female)
– Ordinal: Nominal + rank. Ex. letter grades (A, B, C, D, F)
Order relation is possible, min, max.
– Interval: Ordinal + distance (equal, log-equal).
We can compute sums, differences, means
– Ratio: Interval + zero. We can multiply/divide.
Refs:
• Stanley Smith Stevens (1946, psychology) revisited by several authors
N.Chrisman (1998, geography) A. Wolman (2006, Measurement in Conservation Science)

• Quality issue with: Re-sampling: the additional error is
probably in the error margin of the primary data (declared
income).
• Quality issue with: Removing outliers / or Not. Highest
incomes are generally merged with the highest decile, or
percentile. In that particular dataset, the highest percentile
is dis-aggregated (anonymously: PSE work). And YES, a few
hundreds of incomes are above 1000 time the GMI.
Quality: some acceptable loss in accuracy
Quality: additional brackets are estimated
(real data are confidential)

What’s an Outlier?
“an observation that deviates so much from
other observations as to arouse suspicion that it
was generated by a different mechanism.” gene.
(Hawkins, 1980)
Some Outliers
(www.forbes.com:
top 20 fortunes)

Intermediate conclusion (extremes)
• Is it correct to change the viewpoint?
What’s correct is to present both viewpoints.
• Is it correct to transform data through
aggregation-disaggregation?
What’s correct is to inform about the quality loss
• Is it correct to not ignore “extremes” as “outliers”?
What’s correct is to inform about the quality gain
when keeping them.
Quality in act!
Extremes are
not statistical
outliers

Quality in act
is:
being aware of
the context
Quality in act
in the
Big Data context:
an unmonitored
chain of queries
Quality in act
is
being specific
summing-up the intermediates
Quality in act!
Change viewpoint
Take a step back
Quality in act:
extremes are
not necessarily
statistical
outliers
Quality in act: and what about
Scale?

Yet another influencer (scale)
• Why Is Scale an Effective
Descriptor for Data Quality?
• Andrew U. Frank (2009)
My apologies to my other influencers not cited in this presentation, many
are collectively cited with the REVIGIS project.
(L. Zadeh. “Some Reflections on Information
Granulation and its Centrality in Granular Computing”,
cited by Andrew in that paper

• Brain drain example:
Part-whole relation matters
• PISA example:
(Shannon-Nyquist) sampling theorem matters
ChinaItaly
All
countries
74%
Germany
Scale and Quality
France
70%
2003 20122006 2009

• Query chain example:
Specificity matters
• Extremes example:
(Shannon-Nyquist) sampling theorem matters again
Scale and Quality
Metro
station
is a point
99%
over-sampled
1%
under-
sampled
Equi-population percentiles
99.9% 0.1%
Equi-revenue after re-sampling

Last Conclusion
• Back to the initial question:
does it fit your needs?
• Do we really care about Quality?
We do care about Needs
then (in case of doubt), about Quality.
People acts according to their needs,
if quality fits trust, why to go beyond?
In other words:
if the “nothing but the truth” is respected,
and needs are satisfied,
why to ask for the “whole truth”?
(… a scientist obsession!)

Very Last (personal) Conclusion
• Being wrong is not enough! (to bring corrections)
Because needs were fitted!
• Being right is not enough! (to convince user in error)
Because needs were fitted!
The quality of the demonstration is the key:
many examples:
climate change!
archaeology, art market, social indices, …
Quality in act
is
Ability to Convince

Merci
for the Quality
of your attention

Annex: Open questions.
• What metric best leverages the available data?
– Does the user query require the metric to have particular
properties (some factor resistance, comparability across groups,
etc)?
• Big data introduces new facets for quality:
– queries can be sent to several services,
– time lags can differ, and asynchronous processing can proceed
in an unexpected sequence,
– not all data are updated in the same time,
– incomplete results can be returned,
– Redundancy is everywhere in the Nature: it’s a protection
against “outliers”. Big Data can be a source of redundancy.
– Open data: remember: you never know what people will do
with your data

Yet another influencer
• Darrell Huff
(Published in 1954)
Not necessarily intentional lies:
– correlation and implicit causality,
– goal-oriented visualization,
– etc.

Does Data Quality lays in facts, or in acts?

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Does Data Quality lays in facts, or in acts?

Similar to Does Data Quality lays in facts, or in acts? (20)

Recently uploaded

Recently uploaded (20)

Does Data Quality lays in facts, or in acts?