SlideShare a Scribd company logo
1 of 22
IT Architectures for Handling
Big Data in Official Statistics:
the Case of Scanner Data in Istat
Gianluca D’Amato, Annunziata Fiore,
Domenico Infante, Antonella Simone,
Giorgio Vinci, Antonino Virgillito
Istat
Scanner Data Workshop
Rome 2 October 2015
About me
• Head of unit «Architectures for business
intelligence, mobile and big data»
• Project manager of the UNECE project Big
Data in Official Statistics
• Leader of the IT group in the Istat Scanner
Data project
Abstract
• In this talk we present the issues and challenges
related to dealing with datasets of big size such as
those involved in the Scanner Data project at Istat
• We illustrate the IT architecture backing the testing
phase of the project, currently in place, and the ideas
for the production architecture
• The motivations behind the design are explained as
well as the solutions introduced as part of a larger
scope approach to the modernisation of tools and
techniques used for data storage and processing in
Istat, envisioning the future challenges posed by the
adoption of Big Data and Data Science in NSIs
Data Size - Current
Received and stored data from
6 chains in 6 provinces
30% of all stores available from Nielsen in these provinces
2 and a half years time span
Space occupation of 1 year of microdata in the database
426 Million records
18Gb
550 Stores
210,000 Products
Data Size - Expected
Further 29 provinces will be
received by the end of the year
Estimated size of 1 year of microdata
47Gb
142Gb
6 chains only
All available chains
Actual occupation
2 years of microdata only
36Gb
Current occupation of DB space
200Gb
100Gb
now
End 2015
29 provinces
6 chains
?
Indexes, views, aggregations,
classifications…
540Gb2016?
EVERYTHING!
The Problem with Size…
• Query time is not satisfactory
– Aggregated intermediate results in tables
– IT support required
– More space!
• «Difficult» to extract data for analysis
• Data growth is not predictable
• DBAs do not guarantee proper backup
when space occupation is over 500Gb
Architectural Elements
• Data ingestion
– SFTP
– Custom Java code
• Data architecture
– Relational DB
• Tools
– SAS, Excel…
– Business Analytics Platform (MicroStrategy)
Business Analyitics Platforms
• Allow to access data stored in different data sources and
represent it in a common multi-dimensional schema that is
easier to query and navigate
– Automatically aggregate measures at different levels of
dimentionality
• Normally used in enterprise contexts to facilitate management
of large-heterogeneous data warehouses
• Enable interactive analysis
– Reports: results of queries in a tabular format that can be
browsed and downloaded in Excel or CSV file
– Dashboard: free navigation on data through the creation of
interactive visualizations in drag-and-drop mode
• First example of use in Istat with large data sizes
Load Pre-process
IT Architecture: Testing Phase
SFTP
Views
Reports andVisualizations
DB
SAS
Microstrategy
Control Dashboard
Load Pre-process
Data Ingestion
SFTP SAS
Control Dashboard
Data is sent by Nielsen in form of compressed text files via SFTP (secure
channel)
The SFTP server is located in Istat data center and it is protected by strict
security policies
Load Pre-process
Data Ingestion
SFTP SAS
Control Dashboard
Received data are handled by programs written in Java
- Load: performs integrity checks on received files, loads data in the DB,
logs received files and estimates discounts
- Pre-process: performs quality checks at record level, discards dirty data
The whole acquisition process is controlled by a web dashboard
Load Pre-process
Data Access and Analysis
SFTP
Views
Reports andVisualizations
DB
SAS
Microstrategy
Data can be accessed in two ways:
• Extraction from the DB
• Materialized views were created in order to facilitate import in SAS
• Use of a business analytics tool (MicroStrategy) for reporting,
visualization and browsing of the data
Load Pre-process
Data Access and Analysis
SFTP
Views
Reports andVisualizations
DB
SAS
Microstrategy
In both access modes the results of common interrogations were pre-computed at
different levels of aggregations and were provided as views or reports
This allowed to speed-up access time at the cost of additional space on the DB
Preliminary Analysis
• Analysis of quality of data
– Anomalies
– Distributions
– Nulls
– …
• Exploratory analysis
– Time series of turnover and quantity
– Distribution per coicop group
– …
Turnover per Market
• Total quantity per week and province
Navigation of DB
Production Data Platform
• We are setting up a new data platform for the
production architecture based on Big Data tools
– 7 nodes Hadoop Cluster
– Hadoop: parallel storage and processing platform, de-facto
standard for Big Data
• Features:
– All historical data always online for interactive analysis
– Possibility of retaining historical data indefinitely
– Costruction of a global historical data warehouse of prices
data
– Database is used only for processing current, online data
(computation of indexes)
– Easier to perform large-scale analysis
Ingestion
IT Architecture: Production Phase
SFTP
Reports and
Visualizations
Control Dashboard SAS
Processing
of indexes
Oracle DB Hadoop
Extraction for
offline analysis
Enhanced data warehouse
Sample
selection
Current data Historical data
Conclusions
• The scanner data project represents a
challenging testbed for experimenting new
approaches in the IT support to analysis and
production
• Objective is get faster results and more
efficient processes
• The concept of «Big Data» is not merely a
matter of size but rather of new opportunities
• Technology can give the answers, now it’s
time to make new questions
Questions

More Related Content

What's hot

Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Dataconomy Media
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil GamesRob Winters
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaData Con LA
 
Data mining tools used in business intelligence
Data mining tools used in business intelligenceData mining tools used in business intelligence
Data mining tools used in business intelligenceNithya Ravi
 
Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingDatabricks
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media
 
RWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseRWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseDatabricks
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkBuilding a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkDatabricks
 
Business Insight
Business InsightBusiness Insight
Business InsightMicrosoft
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Why IT Should Consider Agile Modern Data Delivery Platform
Why IT Should Consider Agile Modern Data Delivery PlatformWhy IT Should Consider Agile Modern Data Delivery Platform
Why IT Should Consider Agile Modern Data Delivery Platformsyed_javed
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin RobbinsData Con LA
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...DataWorks Summit
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief OverviewHal Kalechofsky
 

What's hot (20)

Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
 
Tableau @ Spil Games
Tableau @ Spil GamesTableau @ Spil Games
Tableau @ Spil Games
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
How OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman MaryaHow OpenTable uses Big Data to impact growth by Raman Marya
How OpenTable uses Big Data to impact growth by Raman Marya
 
Data mining tools used in business intelligence
Data mining tools used in business intelligenceData mining tools used in business intelligence
Data mining tools used in business intelligence
 
Empowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark StreamingEmpowering Real Time Patient Care Through Spark Streaming
Empowering Real Time Patient Care Through Spark Streaming
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 
RWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseRWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use Case
 
Building a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache SparkBuilding a Distributed Collaborative Data Pipeline with Apache Spark
Building a Distributed Collaborative Data Pipeline with Apache Spark
 
Business Insight
Business InsightBusiness Insight
Business Insight
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
Why IT Should Consider Agile Modern Data Delivery Platform
Why IT Should Consider Agile Modern Data Delivery PlatformWhy IT Should Consider Agile Modern Data Delivery Platform
Why IT Should Consider Agile Modern Data Delivery Platform
 
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
Big Data Fabric for At-Scale Real-Time Analysis by Edwin Robbins
 
The Life of an Internet of Things Electron
The Life of an Internet of Things ElectronThe Life of an Internet of Things Electron
The Life of an Internet of Things Electron
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief Overview
 

Similar to IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat - Antonio Virgillito

Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0Amr Kamel Deklel
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...Big Data Value Association
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Business analytics and data visualisation
Business analytics and data visualisationBusiness analytics and data visualisation
Business analytics and data visualisationShwetabh Jaiswal
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
 
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...Matt Stubbs
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 

Similar to IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat - Antonio Virgillito (20)

Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Big data analytics and machine intelligence v5.0
Big data analytics and machine intelligence   v5.0Big data analytics and machine intelligence   v5.0
Big data analytics and machine intelligence v5.0
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Business analytics and data visualisation
Business analytics and data visualisationBusiness analytics and data visualisation
Business analytics and data visualisation
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Microstrategy Overview
Microstrategy OverviewMicrostrategy Overview
Microstrategy Overview
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
Big Data LDN 2017: How Big Data Insights Become Easily Accessible With Workfl...
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 

More from Istituto nazionale di statistica

More from Istituto nazionale di statistica (20)

Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profitCensimenti Permanenti Istituzioni non profit
Censimenti Permanenti Istituzioni non profit
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
Censimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni PubblicheCensimento Permanente Istituzioni Pubbliche
Censimento Permanente Istituzioni Pubbliche
 
14a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica1414a Conferenza Nazionale di Statisticacnstatistica14
14a Conferenza Nazionale di Statisticacnstatistica14
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 
14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica14a Conferenza Nazionale di Statistica
14a Conferenza Nazionale di Statistica
 

Recently uploaded

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 

Recently uploaded (20)

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat - Antonio Virgillito

  • 1. IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante, Antonella Simone, Giorgio Vinci, Antonino Virgillito Istat Scanner Data Workshop Rome 2 October 2015
  • 2. About me • Head of unit «Architectures for business intelligence, mobile and big data» • Project manager of the UNECE project Big Data in Official Statistics • Leader of the IT group in the Istat Scanner Data project
  • 3. Abstract • In this talk we present the issues and challenges related to dealing with datasets of big size such as those involved in the Scanner Data project at Istat • We illustrate the IT architecture backing the testing phase of the project, currently in place, and the ideas for the production architecture • The motivations behind the design are explained as well as the solutions introduced as part of a larger scope approach to the modernisation of tools and techniques used for data storage and processing in Istat, envisioning the future challenges posed by the adoption of Big Data and Data Science in NSIs
  • 4. Data Size - Current Received and stored data from 6 chains in 6 provinces 30% of all stores available from Nielsen in these provinces 2 and a half years time span Space occupation of 1 year of microdata in the database 426 Million records 18Gb 550 Stores 210,000 Products
  • 5. Data Size - Expected Further 29 provinces will be received by the end of the year Estimated size of 1 year of microdata 47Gb 142Gb 6 chains only All available chains
  • 6. Actual occupation 2 years of microdata only 36Gb Current occupation of DB space 200Gb 100Gb now End 2015 29 provinces 6 chains ? Indexes, views, aggregations, classifications… 540Gb2016? EVERYTHING!
  • 7. The Problem with Size… • Query time is not satisfactory – Aggregated intermediate results in tables – IT support required – More space! • «Difficult» to extract data for analysis • Data growth is not predictable • DBAs do not guarantee proper backup when space occupation is over 500Gb
  • 8. Architectural Elements • Data ingestion – SFTP – Custom Java code • Data architecture – Relational DB • Tools – SAS, Excel… – Business Analytics Platform (MicroStrategy)
  • 9. Business Analyitics Platforms • Allow to access data stored in different data sources and represent it in a common multi-dimensional schema that is easier to query and navigate – Automatically aggregate measures at different levels of dimentionality • Normally used in enterprise contexts to facilitate management of large-heterogeneous data warehouses • Enable interactive analysis – Reports: results of queries in a tabular format that can be browsed and downloaded in Excel or CSV file – Dashboard: free navigation on data through the creation of interactive visualizations in drag-and-drop mode • First example of use in Istat with large data sizes
  • 10. Load Pre-process IT Architecture: Testing Phase SFTP Views Reports andVisualizations DB SAS Microstrategy Control Dashboard
  • 11. Load Pre-process Data Ingestion SFTP SAS Control Dashboard Data is sent by Nielsen in form of compressed text files via SFTP (secure channel) The SFTP server is located in Istat data center and it is protected by strict security policies
  • 12. Load Pre-process Data Ingestion SFTP SAS Control Dashboard Received data are handled by programs written in Java - Load: performs integrity checks on received files, loads data in the DB, logs received files and estimates discounts - Pre-process: performs quality checks at record level, discards dirty data The whole acquisition process is controlled by a web dashboard
  • 13. Load Pre-process Data Access and Analysis SFTP Views Reports andVisualizations DB SAS Microstrategy Data can be accessed in two ways: • Extraction from the DB • Materialized views were created in order to facilitate import in SAS • Use of a business analytics tool (MicroStrategy) for reporting, visualization and browsing of the data
  • 14. Load Pre-process Data Access and Analysis SFTP Views Reports andVisualizations DB SAS Microstrategy In both access modes the results of common interrogations were pre-computed at different levels of aggregations and were provided as views or reports This allowed to speed-up access time at the cost of additional space on the DB
  • 15. Preliminary Analysis • Analysis of quality of data – Anomalies – Distributions – Nulls – … • Exploratory analysis – Time series of turnover and quantity – Distribution per coicop group – …
  • 17. • Total quantity per week and province
  • 19. Production Data Platform • We are setting up a new data platform for the production architecture based on Big Data tools – 7 nodes Hadoop Cluster – Hadoop: parallel storage and processing platform, de-facto standard for Big Data • Features: – All historical data always online for interactive analysis – Possibility of retaining historical data indefinitely – Costruction of a global historical data warehouse of prices data – Database is used only for processing current, online data (computation of indexes) – Easier to perform large-scale analysis
  • 20. Ingestion IT Architecture: Production Phase SFTP Reports and Visualizations Control Dashboard SAS Processing of indexes Oracle DB Hadoop Extraction for offline analysis Enhanced data warehouse Sample selection Current data Historical data
  • 21. Conclusions • The scanner data project represents a challenging testbed for experimenting new approaches in the IT support to analysis and production • Objective is get faster results and more efficient processes • The concept of «Big Data» is not merely a matter of size but rather of new opportunities • Technology can give the answers, now it’s time to make new questions