Quantifying the Value of Federated Datasets in Earth Observation Information Mining and Analytics
Volumes of EO data systematically collected, processed and stored is continuously increasing
It becomes more and more difficult evaluating their “value”
Datasets are made available by different institutions (agencies, commercial providers, etc.) Federation of EO datasets
The Dataset Value is a vague concept: inherent information content, its possible exploitation, relation with user’s application needs, etc.
How to evaluate the value of an EO dataset in a typical scenario of a network of federated datasets?
Our work shows that the Representation Capacity of an EO product dataset D is proportional to the log of the cardinality of the EO dataset.
The value of a federation of datasets should take into account the Representation Capacity, and therefore grows with the log of the size of the individual datasets.
In order to evaluate and compare different datasets in a federation for further processing, a general methodology to preserve the relative information content has been defined.
New models for research and service support are emerging in the Earth Observation context / Data availability from forthcoming missions will increase rapidly
Facilities for EO dissemination and processing services, geographically distributed in a federated domain, largely scalable with reliable Quality of Services are urgently needed
Federated domains shall federate both computing and storage resources. The federation is valued and sustained by the underpinning Earth Observation datasets and their information content
To value datasets federations in wider contexts (e.g. Big Data, Web 2.0) R&D activities are needed to fully exploit the information they contain
A programmatic framework to sustain such R&D activities must be set-up to cover the various aspects involved (Image Information Mining, Time Series analysis, EO data analytics, multi-dimensional databases, semantic web, visual analytics, etc.)
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Image Information Mining Conference: The Sentinels Era
1. Quantifying the Value of
Federated Datasets in Earth
Observation Information
Mining and Analytics
P.G. Marchetti (*), M. Iapaolo (**)
* European Space Agency (ESA/ESRIN)
EOP Research and Ground Segment Technology Section
** Randstad Italia c/o ESA/ESRIN
Image Information Mining Conference
05/03/2014
2. Outline
1. Introduction
2. EO Datasets Value
3. Representation Capacity and Information Content for EO Datasets
4. Initial Results
5. Towards “Big Data”
6. Future work and perspectives
Image Information Mining Conference
05/03/2014
3. Introduction
1. Volumes of EO data systematically collected, processed and stored
is continuously increasing
2. It becomes more and more difficult evaluating their “value”
3. Datasets are made available by different institutions
(agencies, commercial providers, etc.) Federation of EO datasets
4. The Dataset Value is a vague concept: inherent information
content, its possible exploitation, relation with user’s application
needs, etc.
How to evaluate the value of an EO dataset in a typical scenario
of a network of federated datasets?
Image Information Mining Conference
05/03/2014
4. EO Datasets Value
Communication networks: the value of a network (its growth
potential) grows as a quadratic function (n2) with the number of
network nodes n (Metcalf’s Law)
Generic concept of value (importance) applicable to a wide range of
natural phenomena (occurrences of words in a text, size of population
of big cities, etc.): the kth ranked item has a value (frequency, size) of
about 1/k of the first one (Zipf’s Law)
Total Value = sum of decreasing 1/k values over all the n items
≈
Applying to all n nodes:
Image Information Mining Conference
log(n)
Total Value
05/03/2014
≈
n log(n)
5. EO Datasets Value
Plot of nlog(n) growing function, compared
with the linear and quadratic one
The origin is set on n=1.
The Crossover Point with the Zipf’s law is obtained for larger n
with respect to the Metcalf’s law
Image Information Mining Conference
05/03/2014
6. EO Datasets Value and
Information Content
1. In the EO context, it is of paramount importance to assess the value
of datasets from the information content point of view (neither
from growth potential nor from a market value )
2. The actual exploitation of federated datasets is mainly based on
their information content, extracted through time series analysis
and image information mining techniques and analytics
3. The relative value (i.e. the information content) of an EO dataset
permits to:
estimate the number of EO products (or samples) to be used
select which datasets are relevant for an analysis
Need for a theoretical framework for the assessment of the
value (information content) of a federation of EO datasets
Image Information Mining Conference
05/03/2014
7. Representation Capacity
Given a family of n non-overlapping datasets in a
federation, D={D1,D2,…,Dn};
Select from D a sample S={S1, S2, …, Sn}, where each Sh is contained
in Dh (h=1,2,…,n);
Our aim is here to assess and quantify how much S is
representative of D, and how it can characterise the value of D
The Representation Capacity in D, K(D) is a measure for the
degree of arbitrariness in choosing the sample S from D
K(D) should be a non-decreasing function f(x) where x is the size of
the set from which the images must be extracted
Image Information Mining Conference
05/03/2014
10. Representation Capacity
1. The Representation Capacity of an EO product dataset D is
proportional to the log of the cardinality of the EO dataset
2. The value of a federation of datasets should take into account the
Representation Capacity, and therefore grows with the log of the
size of the individual datasets
3. In order to evaluate and compare different datasets in a federation
for further processing, a general methodology to preserve the
relative information content has been defined
Image Information Mining Conference
05/03/2014
11. Comments
1. Additional constraints could be imposed by further
processing, image mining, time series analysis and
statistics/analytics objectives and requirements
2. The simplified approach presented in this paper could allow to
assess the value (information content) a federation of EO dataset
according to the Shannon’s theoretical framework
3. This approach should complement the one derived from the Zipf’s
law, based on the number n of datasets in the federation, to help
decision makers in evaluating the wealth of available information.
Image Information Mining Conference
05/03/2014
13. Initial results 1-3
1. General approach for the assessment of the value – in terms of
information content – of a federation of EO datasets
2. Interpretation of results under the Shannon information theoretical
framework:
o
The information content of a dataset is proportional to its
cardinality
o
Considering a sample of data extracted from the whole
dataset, the Representation Capacity of the dataset is
proportional to the log of its cardinality
o
As a consequence, the value (information content) of a
federation of EO datasets grows with the log of the size of the
individual datasets
Image Information Mining Conference
05/03/2014
14. Initial results 2-3
Oops, if we have a look at the papers…
Number of papers published on IEEE
3000
search performed on 14.02.2014
2500
2000
1500
Series1
1000
500
0
ESA Presentation | DD/MM/YYYY | Slide 14
ESA UNCLASSIFIED – For Official Use
15. Initial results 3-3
The identification of a general method for evaluating, comparing and
selecting different datasets cannot ignore other information elements
like:
•
the papers published and their
quality, content, relevance, citations and impact factors e.g. (see
Hirsch [1]) h-index
•
the papers published and related parameters:
mission, sensor, area, …
•
the web pages published (see PageRank [2])
•
Social media
•
…
Image Information Mining Conference
05/03/2014
16. Future Work, towards “Big Data”
1. New models for research and service support are emerging in the Earth
Observation context / Data availability from forthcoming missions will
increase rapidly
2. Facilities for EO dissemination and processing services, geographically
distributed in a federated domain, largely scalable with reliable Quality
of Services are urgently needed
3. Federated domains shall federate both computing and storage
resources. The federation is valued and sustained by the underpinning
Earth Observation datasets and their information content
4. To value datasets federations in wider contexts (e.g. Big Data, Web 2.0)
R&D activities are needed to fully exploit the information they contain
5. A programmatic framework to sustain such R&D activities must be setup to cover the various aspects involved (IIM, TS analysis, EO data
analytics, multi-dimensional databases, semantic web, visual
analytics, etc.)
Image Information Mining Conference
05/03/2014
17. Future Work, towards “Big Data”
1. The programmatic framework should span a time frame of 5-10 years
2. It should include a strong user validation step (possibly involving
hundreds of users and laboratories)
3. Should be extended to include other domains (not only EO!!): Earth
and Space Science, Engineering … see the announced “Big Data from
Space” Conference !
4. Recent work (Mazzucato) demonstrates the benefits to fund large and
strongly supported research programmes (venture capital and market
will follow, exploiting former consistent investments by state funded
institutions)
5. Research on value-enahnced search for EO data may help in adding
value and is needed to exploit to the great variety of data which will
be made available!
Image Information Mining Conference
05/03/2014
18. References
[1] J.E. Hirsch, An index to quantify an individual's scientific research
output, Proceedings National Academy of Science 46:16569, 2005
[2] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation
Ranking: Bringing Order to the Web. Technical Report. Stanford
InfoLab., 1999
ESA Presentation | DD/MM/YYYY | Slide 18
ESA UNCLASSIFIED – For Official Use