SlideShare a Scribd company logo
1 of 32
TEXTUAL INFORMATION ANALYSIS
FOR THE INTEGRATION
OF DIFFERENT DATA REPOSITORIES
Ministero dell’Economia e delle Finanze
Dipartimento per le Politiche di Sviluppo
UVER – Unità di verifica degli investimenti pubblici
Carlo Amati
THE DEPARTMENT FOR DEVELOPMENT POLICY
UVER: PUBLIC INVESTMENT VERIFICATION UNIT
Project
Verification
Investment
analysisAssessment of results
On-site inspections
Monitoring
databasesAssistance
and support
Forecasting
models
Effectiveness
Appraisal
Monitoring
And Statistics
PUBLIC INVESTMENT MONITORING DATABASES…
APQ
(Regional policy)
MONIT
(EU funds)
COMP
(Pending works)
…
AVLP
Calls for tender
…
… WERE BORN WITH DIFFERENT PURPOSES…
Progress monitoring (duration and expenditure)
Financial audit
Compliance to regulation
Notification to contractors
… AND INFORMATION IS EXTREMELY HETEROGENEOUS
Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 7 Variable 8 Variable 9
Project 1 DB 1 2 3 DB 1 3 DB 1 2 DB 2 3 DB 2 DB 1 2 3 DB 1 2 3 DB 3 DB 1
Project 2 DB 1 3 DB 1 3 DB 1 DB 3 DB DB 1 3 DB 1 3 DB 3 DB 1
Project 3 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB 3 DB 1
Project 4 DB 1 3 DB 1 3 DB 1 DB 3 DB DB 1 3 DB 1 3 DB 3 DB 1
Project 5 DB 2 3 DB 3 DB 2 DB 2 3 DB 2 DB 2 3 DB 2 3 DB 3 DB
Project 6 DB 1 2 3 DB 1 3 DB 1 2 DB 2 3 DB 2 DB 1 2 3 DB 1 2 3 DB 3 DB 1
Project 7 DB 1 3 DB 1 3 DB 1 DB 3 DB DB 1 3 DB 1 3 DB 3 DB 1
Project 8 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB DB 1
Project 9 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB DB 1
Project 10 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB DB 1
Exact data
Incoherent data
Wrong data
Missing data
OUR GOAL
ASSEMBLE ALL THE
AVAILABLE INFORMATION
INTO A
UNIQUE FRAMEWORK
FOR THE ANALYSIS
OF PUBLIC INVESTMENTS
TOWARDS INTEGRATION
Data integration
(matching at micro-level)
Informational
approach
Normative
approach
MIP–CUP
(new primary key)
NORMATIVE APPROACH
MIP: Monitoring system of public investment (Monitoraggio
Investimenti Pubblici), established by the Interministerial
Committee for Economic Planning in order to produce timely
information on the implementation of development policy
(L.144/99).
CUP: Project primary key (Codice unico di progetto), required for
each new or on-going project as of 1st
Jan 2003 (L.3/03).
Must be quoted in every administrative and accounting document,
both paper and digital, regarding a public investment project and
must be reported in every database related to the above projects
(Reg. 24/04).
INFORMATIONAL APPROACH
Recognition of information related to same projects in different
data repositories: each repository usually represents the same item
in a specific format so that it is virtually unfeasible to find a common
variable across different repositories and create an automatic join
between information on the same project.
Integration of the related information: as the relevant information
on a project is dispersed across several databases, some rules
must be defined in order to merge it all into a single repository.
MAIN DATA REPOSITORIES ON PUBLIC INVESTMENTS
APQ
MONIT
COMP
PROJECTS
TOTAL VALUE BN€
AVG. EXPENDITURE PY BN€
YEAR RANGE 2000-2004
CALLS FOR
TENDER
PROJECTS
TOTAL VALUE BN€
AVG. EXPENDITURE PY BN€
YEAR RANGE 1998-2004
AVLP
MEF MONITORING SYSTEMS
PROJECTS
TOTAL VALUE BN€
AVG. EXPENDITURE PY BN€
YEAR RANGE 2000-2004
PROJECTS
TOTAL VALUE BN€
AVG. EXPENDITURE PY BN€
YEAR RANGE 2002-2004
AVG. EXPENDITURE PY
2000-2004
BN€ 31.5
(CPT-IA)
PROJECTS
TOTAL VALUE BN€
AVG. EXPENDITURE PY BN€
YEAR RANGE 1998-2004
10,500
56.5
2.3
6.5
320
2.9
0.3
94,000
71.0
12.3-15.2
532,000
312.0
30.1-38.8
A GLIMPSE OF DATA ON PUBLIC INVESTMENTS
AVLP
CALLS FOR TENDER
OUR TOOLS
X445
Version 9.1.3
Client-server
architecture
Description
Other textual information
QUALITATIVE DATA
Amounts
Times
Location
QUANTITATIVE DATA
Representation
of each project
in an
N-dimensional
space
PROJECT MATCHING
THE MATCHING PROCESS
Define a control sample with known matches/no matches
Define a best strategy to retrieve the known matches
Define an optimal stratification for potential matches
Apply the process to stratified test data
1 – THE CONTROL SAMPLE
Find a common variable with the least repeated
values and treat the values with a single occurrency
in each database as keys for a 1-to-1 matching.
50 billion potential matches
between records of the two
repositories (AVLP-CFT)
_______
_______
_______
_______
_______
_______
_______
_______
A good example of such variable is the
cost of safety-plan.
The dataset built on single occurrencies of cost of safety-
plan contains nearly 5,000 potential matches. 74% of them
are visually processed and classified as MATCH (47%) or
NO_MATCH (27%) and form the control sample.
2 – THE RETRIEVAL STRATEGY
Define a set of rules in order to classify matches on
each variable
Define a combination method in order to synthesize the
variable matches
Define a matching rule in order to separate out good
candidates
The retrieval strategy is tested against the control sample.
2.1 – THE CLASSIFICATION RULES
Quantitative variables  crisp rule based on a cut-off value
for the difference between same variable in different
databases (binary matches)
Qualitative variables (textual information)  transformation
into quantitative variables by means of:
 Text analysis functions
 Text mining algorithms
Then treatment as quantitative variables.
QUANTITATIVE VARIABLES
= Base bid, winning percentage, year of procurement,
implementing body(*)
(*)
Implementing bodies match if their strings are identical
Safety-plan cost is discarded as it is used to build the control sample
ix
Default cut-off value is zero.
Matching
rule
=iMATCH
dxxabs CFT
i
AVLP
i ≤− )(1 if
dxxabs CFT
i
AVLP
i >− )(0 if
CONTROL SUBSAMPLE AND
QUANTITATIVE VARIABLES MATCHES
Variable
Correct mat ches in
cont rol subsample
% correct mat ches
in cont rol subsample
Ext ra mat ches in
cont rol subsample
% ext ra mat ches
in cont rol subsample
Base bid 7 16,7% 0 0,0%
Winning percentage 28 66,7% 44 2,0%
Year of procurement 39 92,9% 707 32,9%
Implementing body 40 95,2% 96 4,5%
For subsequent textual analysis it is easier to use a subset of the control sample.
51 candidate matches for Municipalities in Umbria:
 42 matches
 1 no match
 8 no match (Umbria AVLP misclassfied)
Cartesian product for all possible combinations has 51x43=2193 records.
QUALITATIVE VARIABLE: PROJECT DESCRIPTION
TEXT MINING PROCESS
Project descriptions from both sets are appended into a
single dataset which is fed into the Text Miner node.
TEXT MINING – DIMENSION REDUCTION
The initial number of roll-up terms is set equal to the
number of terms appearing in more than one title. All
other terms are dropped.
Roll-up terms
Singular value decomposition
The number of dimensions given by the roll-up terms is
reduced. The new space has a lower dimension and can
be handled more easily.
TEXT MINING – SVD DISTANCE
The distance in the new space is
computed for all couples of
projects from the two repositories.
The optimal cut-off value that
discriminates matches from no-
matches is obtained with a binary
tree.
MATCHES IN THE CONTROL SUBSAMPLE
The best result is that of the SVD on targeted roll-up terms. This can
be made even better by counting as match also the minimum SVD
distance for each project (42 correct matches – 11 extra matches).
Difference
measure
Correct mat ches in
cont rol subsample
% correct mat ches
in cont rol subsample
Ext ra mat ches in
cont rol subsample
% ext ra mat ches
in cont rol subsample
Raw SVD 22 52,4% 1 0,0%
Raw Roll-up 31 73,8% 5 0,2%
Spedis 34 81,0% 3 0,1%
Compged 28 66,7% 2 0,1%
Complev 26 61,9% 6 0,3%
Targeted Roll-up 37 88,1% 5 0,2%
SVD on targeted roll-up 39 92,9% 6 0,3%
2.2 – THE COMBINATION METHOD
2.3 – THE DISCRIMINATING RULE
Combination method  Sum of the binary matching flags
for all the variables
Other methods: weighted sum, logistic model
Discriminating rule  Strict match if # matching vars GE 3
Loose match if # matching vars GE 2
Other rules: conditions on single variables
CONTROL SAMPLE RESULTS
1
15
21
5
0
5
10
15
20
25
0 1 2 3 4 5
Number of matching variables
Numberofcorrectmatches
Loosecut-offvalue
Strictcut-offvalue
With textual info the proportion of false positives decreases up to 58 percentage points.
Number of mat ching variables
wit h (wit hout ) t ext ual info
Correct
mat ches
% correct
mat ches
Ext ra mat ches
wit h t ext ual info
Ext ra mat ches
wit hout t ext ual info
% false posit ives
wit h t ext ual info
% false posit ives
wit hout t ext ual info
4 (3) 26 61,9% 0 0 0,0% 0,0%
3 (2) 41 97,6% 1 63 2,4% 60,6%
2 (1) 42 100,0% 67 784 61,5% 94,9%
3 – THE OPTIMAL STRATIFICATION
Stratification variables: type of implementing body, region,
year of procurement.
The number of potential matches
grows like n2
.
Municipalities only, covering more
than one half of the projects, would
lead to 12,200 million potential
matches.
Regional stratification, causes drop
to total 968 million (regional
variation in the range 0.4 - 400
million). Further stratification may
be needed in more dense regions
(compressed dataset of 400 million
records requires nearly 15GB for a
match on one quantitative
variable).
Number of matches
4 – APPLICATION TO STRATIFIED TEST DATA
Type of implementing body: Municipalities
Region: Marche
Year of procurement: 2004
Matches in control sample: 2
AVLP projects: 247
CFT projects: 506
Automatic strict matches: 128 (52%)
Automatic loose matches: 160 (65%)
USAGE OF MATCHING RESULTS
Replacement of missing data
Correction of mistakes
Identification of potential evaders
Data integration
FURTHER DEVELOPMENT
Textual information preprocessing
 Built-in macros
 Exploratory analyses with Text Miner
Use of fuzzy classification rules (intervals and levels of
uncertainty, instead of cut-off values)
Stratification variables error-handling
Tie-breaks
 Multiple records in CFT referring to the same tender
 Multiple tenders in CFT referring to the same project
 Very similar records
Textual information analysis for the integration of different data repositories

More Related Content

Similar to Textual information analysis for the integration of different data repositories

How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patentsMIPLM
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Analysis of data science software 2020
Analysis of data science software 2020Analysis of data science software 2020
Analysis of data science software 2020Russ Reinsch
 
A process to improve the accuracy of mk ii fp to cosmic charles symons
A process to improve the accuracy of mk ii fp to cosmic    charles symonsA process to improve the accuracy of mk ii fp to cosmic    charles symons
A process to improve the accuracy of mk ii fp to cosmic charles symonsIWSM Mensura
 
IRJET- Predicting Bitcoin Prices using Convolutional Neural Network Algor...
IRJET-  	  Predicting Bitcoin Prices using Convolutional Neural Network Algor...IRJET-  	  Predicting Bitcoin Prices using Convolutional Neural Network Algor...
IRJET- Predicting Bitcoin Prices using Convolutional Neural Network Algor...IRJET Journal
 
A new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converterA new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converterVLSICS Design
 
An Introduction to Video Principles-Part 1
An Introduction to Video Principles-Part 1   An Introduction to Video Principles-Part 1
An Introduction to Video Principles-Part 1 Dr. Mohieddin Moradi
 
Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...
Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...
Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...IRJET Journal
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2
 
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...IRJET Journal
 
DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...
DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...
DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...IAEME Publication
 
Decisions in a supply chain modeling for comparative evaluation strategies in...
Decisions in a supply chain modeling for comparative evaluation strategies in...Decisions in a supply chain modeling for comparative evaluation strategies in...
Decisions in a supply chain modeling for comparative evaluation strategies in...IAEME Publication
 
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Alejandro Salado
 
Applying Neural Networks and Analogous Estimating to Determine the Project Bu...
Applying Neural Networks and Analogous Estimating to Determine the Project Bu...Applying Neural Networks and Analogous Estimating to Determine the Project Bu...
Applying Neural Networks and Analogous Estimating to Determine the Project Bu...Ricardo Viana Vargas
 
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET Journal
 
Next generation network oss bss market and forecast 2013-2018 - Reports Corner
Next generation network oss bss market and forecast 2013-2018 - Reports CornerNext generation network oss bss market and forecast 2013-2018 - Reports Corner
Next generation network oss bss market and forecast 2013-2018 - Reports CornerReports Corner
 
TSO Reliability Management: a probabilistic approach for better balance betwe...
TSO Reliability Management: a probabilistic approach for better balance betwe...TSO Reliability Management: a probabilistic approach for better balance betwe...
TSO Reliability Management: a probabilistic approach for better balance betwe...Leonardo ENERGY
 
Forecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction IndustryForecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction IndustryIRJET Journal
 
Forecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction IndustryForecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction IndustryIRJET Journal
 

Similar to Textual information analysis for the integration of different data repositories (20)

How to valuate and determine standard essential patents
How to valuate and determine standard essential patentsHow to valuate and determine standard essential patents
How to valuate and determine standard essential patents
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Analysis of data science software 2020
Analysis of data science software 2020Analysis of data science software 2020
Analysis of data science software 2020
 
A process to improve the accuracy of mk ii fp to cosmic charles symons
A process to improve the accuracy of mk ii fp to cosmic    charles symonsA process to improve the accuracy of mk ii fp to cosmic    charles symons
A process to improve the accuracy of mk ii fp to cosmic charles symons
 
IRJET- Predicting Bitcoin Prices using Convolutional Neural Network Algor...
IRJET-  	  Predicting Bitcoin Prices using Convolutional Neural Network Algor...IRJET-  	  Predicting Bitcoin Prices using Convolutional Neural Network Algor...
IRJET- Predicting Bitcoin Prices using Convolutional Neural Network Algor...
 
A new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converterA new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converter
 
An Introduction to Video Principles-Part 1
An Introduction to Video Principles-Part 1   An Introduction to Video Principles-Part 1
An Introduction to Video Principles-Part 1
 
Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...
Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...
Optimizing Data Encoding Technique For Dynamic Power Reduction In Network On ...
 
Costing
CostingCosting
Costing
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...IRJET-  	  Security Efficiency of Transfering the Data for Wireless Sensor Ne...
IRJET- Security Efficiency of Transfering the Data for Wireless Sensor Ne...
 
DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...
DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...
DECISIONS IN A SUPPLY CHAIN MODELING FOR COMPARATIVE EVALUATION STRATEGIES IN...
 
Decisions in a supply chain modeling for comparative evaluation strategies in...
Decisions in a supply chain modeling for comparative evaluation strategies in...Decisions in a supply chain modeling for comparative evaluation strategies in...
Decisions in a supply chain modeling for comparative evaluation strategies in...
 
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
Assessing the Impacts of Uncertainty Propagation to System Requirements by Ev...
 
Applying Neural Networks and Analogous Estimating to Determine the Project Bu...
Applying Neural Networks and Analogous Estimating to Determine the Project Bu...Applying Neural Networks and Analogous Estimating to Determine the Project Bu...
Applying Neural Networks and Analogous Estimating to Determine the Project Bu...
 
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
IRJET - A Research on Eloquent Salvation and Productive Outsourcing of Massiv...
 
Next generation network oss bss market and forecast 2013-2018 - Reports Corner
Next generation network oss bss market and forecast 2013-2018 - Reports CornerNext generation network oss bss market and forecast 2013-2018 - Reports Corner
Next generation network oss bss market and forecast 2013-2018 - Reports Corner
 
TSO Reliability Management: a probabilistic approach for better balance betwe...
TSO Reliability Management: a probabilistic approach for better balance betwe...TSO Reliability Management: a probabilistic approach for better balance betwe...
TSO Reliability Management: a probabilistic approach for better balance betwe...
 
Forecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction IndustryForecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction Industry
 
Forecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction IndustryForecasting and Rate Analysis of Cost Escalation for Construction Industry
Forecasting and Rate Analysis of Cost Escalation for Construction Industry
 

More from carloamati

What happens in territories? Explore how your local administration is performing
What happens in territories? Explore how your local administration is performingWhat happens in territories? Explore how your local administration is performing
What happens in territories? Explore how your local administration is performingcarloamati
 
DPS and CPT eXplorer: connecting data & policy
DPS and CPT eXplorer: connecting data & policyDPS and CPT eXplorer: connecting data & policy
DPS and CPT eXplorer: connecting data & policycarloamati
 
eXplorer: a dynamic statistical data visualization tool to support Italian de...
eXplorer: a dynamic statistical data visualization tool to support Italian de...eXplorer: a dynamic statistical data visualization tool to support Italian de...
eXplorer: a dynamic statistical data visualization tool to support Italian de...carloamati
 
DPS eXplorer: a tool for transparency on public expenditure
DPS eXplorer: a tool for transparency on public expenditureDPS eXplorer: a tool for transparency on public expenditure
DPS eXplorer: a tool for transparency on public expenditurecarloamati
 
Capacity building via OpenCoesione, the Italian open strategy on cohesion po...
Capacity building via OpenCoesione, the Italian open strategy on cohesion po...Capacity building via OpenCoesione, the Italian open strategy on cohesion po...
Capacity building via OpenCoesione, the Italian open strategy on cohesion po...carloamati
 
VISTO: An operational tool to visualise the duration of public investment pro...
VISTO: An operational tool to visualise the duration of public investment pro...VISTO: An operational tool to visualise the duration of public investment pro...
VISTO: An operational tool to visualise the duration of public investment pro...carloamati
 

More from carloamati (6)

What happens in territories? Explore how your local administration is performing
What happens in territories? Explore how your local administration is performingWhat happens in territories? Explore how your local administration is performing
What happens in territories? Explore how your local administration is performing
 
DPS and CPT eXplorer: connecting data & policy
DPS and CPT eXplorer: connecting data & policyDPS and CPT eXplorer: connecting data & policy
DPS and CPT eXplorer: connecting data & policy
 
eXplorer: a dynamic statistical data visualization tool to support Italian de...
eXplorer: a dynamic statistical data visualization tool to support Italian de...eXplorer: a dynamic statistical data visualization tool to support Italian de...
eXplorer: a dynamic statistical data visualization tool to support Italian de...
 
DPS eXplorer: a tool for transparency on public expenditure
DPS eXplorer: a tool for transparency on public expenditureDPS eXplorer: a tool for transparency on public expenditure
DPS eXplorer: a tool for transparency on public expenditure
 
Capacity building via OpenCoesione, the Italian open strategy on cohesion po...
Capacity building via OpenCoesione, the Italian open strategy on cohesion po...Capacity building via OpenCoesione, the Italian open strategy on cohesion po...
Capacity building via OpenCoesione, the Italian open strategy on cohesion po...
 
VISTO: An operational tool to visualise the duration of public investment pro...
VISTO: An operational tool to visualise the duration of public investment pro...VISTO: An operational tool to visualise the duration of public investment pro...
VISTO: An operational tool to visualise the duration of public investment pro...
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 

Textual information analysis for the integration of different data repositories

  • 1. TEXTUAL INFORMATION ANALYSIS FOR THE INTEGRATION OF DIFFERENT DATA REPOSITORIES Ministero dell’Economia e delle Finanze Dipartimento per le Politiche di Sviluppo UVER – Unità di verifica degli investimenti pubblici Carlo Amati
  • 2. THE DEPARTMENT FOR DEVELOPMENT POLICY
  • 3. UVER: PUBLIC INVESTMENT VERIFICATION UNIT Project Verification Investment analysisAssessment of results On-site inspections Monitoring databasesAssistance and support Forecasting models Effectiveness Appraisal Monitoring And Statistics
  • 4. PUBLIC INVESTMENT MONITORING DATABASES… APQ (Regional policy) MONIT (EU funds) COMP (Pending works) … AVLP Calls for tender …
  • 5. … WERE BORN WITH DIFFERENT PURPOSES… Progress monitoring (duration and expenditure) Financial audit Compliance to regulation Notification to contractors
  • 6. … AND INFORMATION IS EXTREMELY HETEROGENEOUS Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6 Variable 7 Variable 8 Variable 9 Project 1 DB 1 2 3 DB 1 3 DB 1 2 DB 2 3 DB 2 DB 1 2 3 DB 1 2 3 DB 3 DB 1 Project 2 DB 1 3 DB 1 3 DB 1 DB 3 DB DB 1 3 DB 1 3 DB 3 DB 1 Project 3 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB 3 DB 1 Project 4 DB 1 3 DB 1 3 DB 1 DB 3 DB DB 1 3 DB 1 3 DB 3 DB 1 Project 5 DB 2 3 DB 3 DB 2 DB 2 3 DB 2 DB 2 3 DB 2 3 DB 3 DB Project 6 DB 1 2 3 DB 1 3 DB 1 2 DB 2 3 DB 2 DB 1 2 3 DB 1 2 3 DB 3 DB 1 Project 7 DB 1 3 DB 1 3 DB 1 DB 3 DB DB 1 3 DB 1 3 DB 3 DB 1 Project 8 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB DB 1 Project 9 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB DB 1 Project 10 DB 1 2 DB 1 DB 1 2 DB 2 DB 2 DB 1 2 DB 1 2 DB DB 1 Exact data Incoherent data Wrong data Missing data
  • 7. OUR GOAL ASSEMBLE ALL THE AVAILABLE INFORMATION INTO A UNIQUE FRAMEWORK FOR THE ANALYSIS OF PUBLIC INVESTMENTS
  • 8. TOWARDS INTEGRATION Data integration (matching at micro-level) Informational approach Normative approach MIP–CUP (new primary key)
  • 9. NORMATIVE APPROACH MIP: Monitoring system of public investment (Monitoraggio Investimenti Pubblici), established by the Interministerial Committee for Economic Planning in order to produce timely information on the implementation of development policy (L.144/99). CUP: Project primary key (Codice unico di progetto), required for each new or on-going project as of 1st Jan 2003 (L.3/03). Must be quoted in every administrative and accounting document, both paper and digital, regarding a public investment project and must be reported in every database related to the above projects (Reg. 24/04).
  • 10. INFORMATIONAL APPROACH Recognition of information related to same projects in different data repositories: each repository usually represents the same item in a specific format so that it is virtually unfeasible to find a common variable across different repositories and create an automatic join between information on the same project. Integration of the related information: as the relevant information on a project is dispersed across several databases, some rules must be defined in order to merge it all into a single repository.
  • 11. MAIN DATA REPOSITORIES ON PUBLIC INVESTMENTS APQ MONIT COMP PROJECTS TOTAL VALUE BN€ AVG. EXPENDITURE PY BN€ YEAR RANGE 2000-2004 CALLS FOR TENDER PROJECTS TOTAL VALUE BN€ AVG. EXPENDITURE PY BN€ YEAR RANGE 1998-2004 AVLP MEF MONITORING SYSTEMS PROJECTS TOTAL VALUE BN€ AVG. EXPENDITURE PY BN€ YEAR RANGE 2000-2004 PROJECTS TOTAL VALUE BN€ AVG. EXPENDITURE PY BN€ YEAR RANGE 2002-2004 AVG. EXPENDITURE PY 2000-2004 BN€ 31.5 (CPT-IA) PROJECTS TOTAL VALUE BN€ AVG. EXPENDITURE PY BN€ YEAR RANGE 1998-2004 10,500 56.5 2.3 6.5 320 2.9 0.3 94,000 71.0 12.3-15.2 532,000 312.0 30.1-38.8
  • 12. A GLIMPSE OF DATA ON PUBLIC INVESTMENTS AVLP CALLS FOR TENDER
  • 14. Description Other textual information QUALITATIVE DATA Amounts Times Location QUANTITATIVE DATA Representation of each project in an N-dimensional space PROJECT MATCHING
  • 15. THE MATCHING PROCESS Define a control sample with known matches/no matches Define a best strategy to retrieve the known matches Define an optimal stratification for potential matches Apply the process to stratified test data
  • 16. 1 – THE CONTROL SAMPLE Find a common variable with the least repeated values and treat the values with a single occurrency in each database as keys for a 1-to-1 matching. 50 billion potential matches between records of the two repositories (AVLP-CFT) _______ _______ _______ _______ _______ _______ _______ _______ A good example of such variable is the cost of safety-plan. The dataset built on single occurrencies of cost of safety- plan contains nearly 5,000 potential matches. 74% of them are visually processed and classified as MATCH (47%) or NO_MATCH (27%) and form the control sample.
  • 17. 2 – THE RETRIEVAL STRATEGY Define a set of rules in order to classify matches on each variable Define a combination method in order to synthesize the variable matches Define a matching rule in order to separate out good candidates The retrieval strategy is tested against the control sample.
  • 18. 2.1 – THE CLASSIFICATION RULES Quantitative variables  crisp rule based on a cut-off value for the difference between same variable in different databases (binary matches) Qualitative variables (textual information)  transformation into quantitative variables by means of:  Text analysis functions  Text mining algorithms Then treatment as quantitative variables.
  • 19. QUANTITATIVE VARIABLES = Base bid, winning percentage, year of procurement, implementing body(*) (*) Implementing bodies match if their strings are identical Safety-plan cost is discarded as it is used to build the control sample ix Default cut-off value is zero. Matching rule =iMATCH dxxabs CFT i AVLP i ≤− )(1 if dxxabs CFT i AVLP i >− )(0 if
  • 20. CONTROL SUBSAMPLE AND QUANTITATIVE VARIABLES MATCHES Variable Correct mat ches in cont rol subsample % correct mat ches in cont rol subsample Ext ra mat ches in cont rol subsample % ext ra mat ches in cont rol subsample Base bid 7 16,7% 0 0,0% Winning percentage 28 66,7% 44 2,0% Year of procurement 39 92,9% 707 32,9% Implementing body 40 95,2% 96 4,5% For subsequent textual analysis it is easier to use a subset of the control sample. 51 candidate matches for Municipalities in Umbria:  42 matches  1 no match  8 no match (Umbria AVLP misclassfied) Cartesian product for all possible combinations has 51x43=2193 records.
  • 22. TEXT MINING PROCESS Project descriptions from both sets are appended into a single dataset which is fed into the Text Miner node.
  • 23. TEXT MINING – DIMENSION REDUCTION The initial number of roll-up terms is set equal to the number of terms appearing in more than one title. All other terms are dropped. Roll-up terms Singular value decomposition The number of dimensions given by the roll-up terms is reduced. The new space has a lower dimension and can be handled more easily.
  • 24. TEXT MINING – SVD DISTANCE The distance in the new space is computed for all couples of projects from the two repositories. The optimal cut-off value that discriminates matches from no- matches is obtained with a binary tree.
  • 25. MATCHES IN THE CONTROL SUBSAMPLE The best result is that of the SVD on targeted roll-up terms. This can be made even better by counting as match also the minimum SVD distance for each project (42 correct matches – 11 extra matches). Difference measure Correct mat ches in cont rol subsample % correct mat ches in cont rol subsample Ext ra mat ches in cont rol subsample % ext ra mat ches in cont rol subsample Raw SVD 22 52,4% 1 0,0% Raw Roll-up 31 73,8% 5 0,2% Spedis 34 81,0% 3 0,1% Compged 28 66,7% 2 0,1% Complev 26 61,9% 6 0,3% Targeted Roll-up 37 88,1% 5 0,2% SVD on targeted roll-up 39 92,9% 6 0,3%
  • 26. 2.2 – THE COMBINATION METHOD 2.3 – THE DISCRIMINATING RULE Combination method  Sum of the binary matching flags for all the variables Other methods: weighted sum, logistic model Discriminating rule  Strict match if # matching vars GE 3 Loose match if # matching vars GE 2 Other rules: conditions on single variables
  • 27. CONTROL SAMPLE RESULTS 1 15 21 5 0 5 10 15 20 25 0 1 2 3 4 5 Number of matching variables Numberofcorrectmatches Loosecut-offvalue Strictcut-offvalue With textual info the proportion of false positives decreases up to 58 percentage points. Number of mat ching variables wit h (wit hout ) t ext ual info Correct mat ches % correct mat ches Ext ra mat ches wit h t ext ual info Ext ra mat ches wit hout t ext ual info % false posit ives wit h t ext ual info % false posit ives wit hout t ext ual info 4 (3) 26 61,9% 0 0 0,0% 0,0% 3 (2) 41 97,6% 1 63 2,4% 60,6% 2 (1) 42 100,0% 67 784 61,5% 94,9%
  • 28. 3 – THE OPTIMAL STRATIFICATION Stratification variables: type of implementing body, region, year of procurement. The number of potential matches grows like n2 . Municipalities only, covering more than one half of the projects, would lead to 12,200 million potential matches. Regional stratification, causes drop to total 968 million (regional variation in the range 0.4 - 400 million). Further stratification may be needed in more dense regions (compressed dataset of 400 million records requires nearly 15GB for a match on one quantitative variable). Number of matches
  • 29. 4 – APPLICATION TO STRATIFIED TEST DATA Type of implementing body: Municipalities Region: Marche Year of procurement: 2004 Matches in control sample: 2 AVLP projects: 247 CFT projects: 506 Automatic strict matches: 128 (52%) Automatic loose matches: 160 (65%)
  • 30. USAGE OF MATCHING RESULTS Replacement of missing data Correction of mistakes Identification of potential evaders Data integration
  • 31. FURTHER DEVELOPMENT Textual information preprocessing  Built-in macros  Exploratory analyses with Text Miner Use of fuzzy classification rules (intervals and levels of uncertainty, instead of cut-off values) Stratification variables error-handling Tie-breaks  Multiple records in CFT referring to the same tender  Multiple tenders in CFT referring to the same project  Very similar records