The document describes DataSTAT Hub, a tool developed by the Italian National Institute of Statistics (Istat) for the automatic collection of administrative data to produce official statistics. DataSTAT Hub uses REST architecture and HTTP to enable the automated collection, standardization, and integration of administrative data from various sources in a scalable way. It models data as key-value pairs to accommodate heterogeneous data types and uses Elasticsearch for indexing and searching large amounts of data. This allows for easier and more efficient collection and dissemination of administrative data compared to traditional methods.
1. A tool for the automatic collection of administrative data
to produce official statistics
Conference of European Statistics Stakeholders
Budapest, 20-21 October 2016
Alessandro Capezzuoli, Emanuela Recchini
2. Official statistics and data integration1
3
4
2
Model
Technology
Architecture
5 Concluding remarks
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
3. 1. Official statistics and data integration
1
Bringing together information from different
sources makes it possible to fill information
gaps or provide insights which cannot be
gleaned from unlinked data and to improve
the knowledge and understanding of
specific phenomena.
Introductory remarks (1)
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
There is worldwide recognition of the
increasing role played by administrative data
in the production of more timely, more
disaggregated statistics at higher frequencies
than traditional survey data.
The efficient use of all available information
to produce timely, accurate and high quality
statistics is a challenge for National Statistical
Offices (NSOs), which are even more
committed to developing methods and
suitable tools for the production, collection,
standardization and integration of different
types of statistical data.
4. Nowadays, the exploitation of administrative data for statistical purposes is a normal
practice for a large number of NSOs. This improves the quality of statistical outputs, reduces
the statistical burden on respondents and minimizes costs.
The Italian National Institute of Statistics (Istat) collects and manages a large amounts of
administrative data from different sources, among which:
• Italian Agency of Revenue
• Bank of Italy
• Ministries
• Social Security Institutions
• Government Institutions
• Private Institutions
• …
From 2009 to
2015,
administrative
data supplied
to Istat have
trebled
1. Official statistics and data integration
Introductory remarks (2)
2
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
5. According to the provisions of the Italian Digital Administration Code:
➢ before proceeding to the collection of new data, public administrations are required
to verify whether the information they need can be acquired through access to
information already in the possession of other public authorities or public bodies.
➢ the technical options for the usability of data are:
web access through the website of the supplier institution or an ad hoc thematic
website
Interoperability among public administrations for data collection and data
integration
the user can process data collected exclusively for the pursuit of its institutional
goals; data transfer from one information system to another does not change data
ownership
the transfer of a data from an information system to another does not change the
ownership of the given
1. Official statistics and data integration
The Italian legislation on data collection
(Guidelines for the drafting of conventions on the usability Public Administrations data; Legislative Decree n. 82/2005,
commonly referred to as the “Digital Administration Code”, modified by the Legislative Decree n. 235/2010)
3
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
6. 1. Official statistics and data integration
Administrative data collected by Istat
Data collected
by Istat are very
different from
each other in
type, content and
structure
4
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
7. DATA SUPPLIER
- receives data requests
- elaborates data requests
- prepares data to be sent
- sends data to data collector
DATA COLLECTOR
- manages data requests
- defines methods and standards
- manages reminders
- stores data and metadata
- standardizes and disseminates data
1. Official statistics and data integration
Data collection process (1)
5
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
8. 1. Official statistics and data integration
Data collection process (2)
✓ Data collection through File Transfer Protocol
(FTP)
✓ Data uploading through an ad hoc website to
manage reminders and data supply requests
THESE SOLUTIONS DO NOT PERMIT PROCESS AUTOMATION
✓ Management of data requests and reminders
✓ Complex IT infrastructure
✓ Burden for data suppliers
✓ Human resources for transactions management
6
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
9. 2. Tecnology
Representational State Transfer (REST)
• is not a standard, is just an architecture style for designing networked applications
• defines a set of guidelines to use the HTTP protocol in order to perform 4 operations summarized in the acronym
CRUD (Create, Read, Update, Delete), by means of an API (Application Programming Interface).
…the World Wide Web offers a possible solution!
HTTP (Hypertext Transfer Protocol), the set of rules for transferring files on the Web, can be
conveniently used for data collection and data exchange.
It is a request/response protocol based on the client-server architecture.
7
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
10. CRUD principles
REST is a service concept that may be summarized by the CRUD principles
REST allows data suppliers
to create, read and update
resources with a logic
similar to that used to
perform operations on any
SQL database.
2. Tecnology
8
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
11. REST architecture enables users to separate relational DB from
the client through an API, which exploits HTTP to transmit data
and exchange information.
2. Tecnology
9
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
12. 3. Model
UNSTRUCTURED DATA - a model collecting data in their essence (key/value) is more convenient
and immediate than defining multiple standards for data representation;
SCALABILITY - a highly extensible architecture is needed, in case of possible conceptual/architectural
future upgrade;
INTUITIVE SCHEMA - the model should be easily applied by data suppliers, without resorting to
complex studies of any imposed standard;
BIG-DATA-ORIENTED ARCHITECTURE - the system should be in line with big-data processing
techniques;
INTEGRATION WITH MODERN IT TOOLS FOR BIG DATA - storage is closely linked to the tools
used for semantic search, data analysis and data visualization. Elasticsearch, Hadhoop, Solr, Cassandra
provide a complete integrated environment for managing them.
The different types of data, IT tools and skills of data suppliers require a model implying:
10
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
13. KEY/VALUE storage model
{
"keyspace" :
{
"columnfamily" :
{
"rowkey" :
{
"supercolumn" :
{
"column name" : "column value"
}
}
}
}
}
Statistical Key Value
Data Model
3. Model
The format that is better suited
for HTTP use is JSON (JavaScript
Object Notation) to which
different models for data
representation can be
associated. In particular, dealing
with highly heterogeneous data,
it is recommended to use a
model to represent them in their
simplest form: a key/value pair.
11
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
14. 4. Architecture
DataSTAT Hub is a tool for data collection that takes advantage of the potential
offered by HTTP 2.0 and REST architecture and exploits the methods offered by
the CRUD architecture (Create, Read, Update, Delete).
12
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
15. Most entities or objects in most applications can be serialized into a
JSON object, with keys and values. A key is the name of a field or
property, and a value can be a string, a number, a Boolean, another
object, an array of values, or some other specialized type such as a
string representing a date or an object representing a geolocation.
Elasticsearch is an open source search engine that
can be conveniently used for collection and
release of data. Through Elasticsearch it is
possible to index and map documents/data
through querystrings to be sent via HTTP in JSON
format.
4. Architecture
Documents are indexed—stored and made searchable—by
using the index API, which uniquely identify the document.
Mapping is the process of defining how a document,
and the fields it contains, are stored and indexed.
DOCUMENT
INDEX / TYPE
MAPPING
13
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
16. ELASTICSERACH
Data contained in the index can be easily
stored in a database that uses the
Key/Value model (Eg. Cassandra)
Data suppliers can autonomously
create data index, describe data
content and perform any operation
on them (put/update/delete/get)
Indexed data have an immediate dissemination
channel which Elasticsearch is associated to as a
powerful engine for searching among big data
and, possibly, an API that standardizes the output
4. Architecture
DATA SUPPLIER
OUTPUT CHANNEL
DATA STORAGE
14
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
17. ELASTICSERACH
4. Architecture
DATA SUPPLIER
15
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016
SEARCH ENGINE
REST WEBSERVICES
WIDGET / USERS
INTERFACE
Datastat Hub applied to statistical
classifications
www. statisticlass.eu
18. 5. Concluding remarks
DataSTAT Hub is a suitable and easy tool for the automated collection,
standardization and integration of administrative data.
Reduction of burden on users: this hub does not require the knowledge of the
internal data base since the updating is performed through the HTTP querystrings
and can be used with any programming language; once created, the procedure
will be used for each next data supply.
Reduction of costs in terms of employment of human resources for organizational,
bureaucratic and IT management
By allowing us to overcome some critical issues related to the use of
administrative data, including those connected with privacy and security, a tool
such as DataSTAT Hub is time-saving and cost-effective.
It is a user-friendly tool developed by making use of open source technologies and
can be conveniently shared among NSOs, while it is extensible to any other
institution.
16
DataSTAT Hub: a tool for the automatic collection of administrative data to produce official statistics
Alessandro Capezzuoli, Emanuela Recchini – Conference of European Statistics Stakeholders, Budapest, 20-21 October 2016