SlideShare a Scribd company logo
Machine Learning on public procurement
open data
The ANAC case study
ANAC Dataset
ANAC Dataset
ANAC Dataset
- 4 Millions of public procurements extracted for the year 2017
- 14.000 Public Administrations (PA)
- 500.000 private companies
- We integrate data with other sources:
- Indice PA: contains detailed information about PA
- Open Consip: contains detailed information about 100.000 companies
ANAC Dataset
- cig: id of a public procurement (very noisy)
- Info about pa: fiscal code, name
- Info about the winner company and other
competitors: fiscal code, name (very noisy)
- Info about a public procurement: title, type,
cost/revenue in euro, starting date, ending date
What information can we extract?
Who are the top companies in terms of won public procurements and their
earnings? How their revenues depend on the public sector?

What information can we extract?
Given a Company, show detailed information about won public procurements,
types of Pa involved, region involved.
What information can we extract?
Given a Company, show detailed information about won public procurements,
types of Pa involved, region involved.
Existing Projects
There are few existing projects that show that information
Anything else?
From explicit relationships
The Dataset contains only two types of relationships:
- “Winner”
From explicit relationships
The Dataset contains only two types of relationships:
- “Competitors”
To hidden relationships
Revenues
Public
Administrations
Companies
Idea: Extracting hidden relationships among Public Administrations and
Private Companies using unstructured information as public procurement titles.
Goal:
- Identifying PA with same needs
(i.e. require similar services);
- Identifying indirect Competitors
among Companies;
- Help a PA to find companies
(e.g. find all companies that sell
wine)
How to represent PA and Companies?
Synthetic documents: for each PA and companies we concatenate their public
procurement titles obtaining a synthetic document
fornitura di carburante per automezzi comunali
fornitura di carburante per automezzi comunali
impegno di spesa per fornitura carburante per i mezzi in
dotazione alla polizia locale
fornitura carburante per mezzi in dotazione al gruppo comunale
volontari protezione civileaib mediante adesione alla
convenzione consip spa assunzione relativo impegno di spesa
per il periodo 0101 2017 al 31122017
fornitura carburante comune di testico determina n 692017
fornitura carburante comuni di stellanello determina n 69 del
06032017
acquisto carburante panda e moto
fornitura carburante per autotrazione mediante adesione alla
convenzione consip spa tramite buoni carburante per il periodo
010931122017
Synthetic doc
Multidimensional
representation
Analyze Unstructured Information
Each document is represented in a continuous vector space.
We consider three different representations:
- Vector Space Model
- Embedding Centroid
- Weighted Embedding
Vector Space Model
Each document is represented as a vector with the size of the vocabulary, where
each element gives the frequency of a word in the document or some weight
derived from the frequency (tf-idf).
Fornitura sale per disgelo stradale
Fornitura sale uso disgelo
d1
d2
cosine_sim(d1,d2) = 0.79
Vector Space Model
Each document is represented as a vector with the size of the vocabulary, where
each element gives the frequency of a word in the document or some weight
derived from the frequency.
Impegno di spesa per acquisto sale da disgelo per
il servizio di sgombero neve
Fornitura sale per disgelo stradale
VSM methods perform poorly
on short documents (e.g.
titles). This is simply because
word overlap is minimal even
for related documents.
cosine_sim(d1,d2) = 0.33
d1
d3
...
...
Embedding centroids
“You shall know a word by the company it keeps” (Firth 1957)
- Word embeddings are powerful representations and contain a great deal of
contextual information
- Intuition: the context represents the semantics (e.g. smart and intelligents
would have similar context)
A medical doctor is a person who uses medicine to treat illness and injuries
Some medical doctors only work in certain diseases or injuries
Medical doctors examine, diagnose and treat patients
Context is used to represent
doctor/doctors
Word embedding Example
Embedding centroids
Each document is represented as the centroid between all its word vectors.
Example:
d1: Impegno spesa per acquisto sale da disgelo per servizio di sgombero neve
d2: Fornitura sale per disgelo stradale
d3: Fornitura per ufficio
Impegno
spesa
acquisto
disgelo
neve
servizio
sale
fornitura
stradale
d1
d2
cancelleriacartoleria
ufficio d3
cosine_sim(d1,d2) = 0.80
cosine_sim(d1,d3) = 0.30
Embedding centroids
The intuition is that word embeddings will help in shorter technical text such as
titles or abstracts, where exact word overlap may not often be enough.
Using embeddings on long text adds some degree of noise and computing a
centroid may lead to some information loss.
Embedding for a boiler seller
(cf: 02254030204)
modificato
determinazione
sostituzione
valvola
caldaia
materna
comma manutenzione
palestra alloggio
idrauliche
centralina
scuola
comunale
via
contrarre
pavimento
presso
lettera
Weighted Embedding
We may compute a weighted centroid using the VSM vectors (i.e. TF-IDF) as
coefficients.
Embedding for a boiler seller
(cf: 02254030204)
modificato
determinazione
sostituzione
valvola
caldaia
materna
comma manutenzione
palestra alloggio
idrauliche
centralina
scuola
comunale
via
contrarre
pavimento
presso
Example
Common top documents
Top documents for VSM
There are 37 documents with cosine similarity greater than 0.4
Top documents for Weighted Embedding
There are more than 1000 documents with cosine similarity greater than 0.7
Comparison top k similar documents
SVM vs Emb.
Centroid
SVM vs Weighted
Emb.
Emb. Centroid vs
Weighted Emb.
Demo
Fork the source code!
Fabiana Lanotte
Data Engineer
fabiana@teamdigitale.governo.it
Linkedin: fabiana-lanotte
Github: fabiana001

More Related Content

Similar to Machine learning on public procurement open data - ANAC case study

Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...
Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...
Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...
Support for Improvement in Governance and Management SIGMA
 
SMU DRIVE SPRING 2017 MBA105 - MANAGERIAL ECONOMICS free solved assignment
 SMU DRIVE SPRING 2017  MBA105 - MANAGERIAL ECONOMICS free solved assignment SMU DRIVE SPRING 2017  MBA105 - MANAGERIAL ECONOMICS free solved assignment
SMU DRIVE SPRING 2017 MBA105 - MANAGERIAL ECONOMICS free solved assignment
rahul kumar verma
 
Is small good kent seminar 6 march 2013
Is small good   kent seminar 6 march 2013Is small good   kent seminar 6 march 2013
Is small good kent seminar 6 march 2013
CASEKent
 
Vehicle Intercom System – Trend and Market Analysis
Vehicle Intercom System – Trend and Market AnalysisVehicle Intercom System – Trend and Market Analysis
Vehicle Intercom System – Trend and Market Analysis
AryanRaj496746
 
Insurance playbook q32020 3
Insurance playbook q32020 3Insurance playbook q32020 3
Insurance playbook q32020 3
HaoDoChi
 
Public Private Partnership
Public Private Partnership Public Private Partnership
Public Private Partnership Peter Parycek
 
Spend Network introduction to our data
Spend Network   introduction to our dataSpend Network   introduction to our data
Spend Network introduction to our data
Helen McNally
 
Semantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop PerspectiveSemantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop Perspective
Università degli Studi di Milano-Bicocca
 
Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)
quidsupport
 
Manylaws - Legal Informatics Services towards Administrations, Businesses and...
Manylaws - Legal Informatics Services towards Administrations, Businesses and...Manylaws - Legal Informatics Services towards Administrations, Businesses and...
Manylaws - Legal Informatics Services towards Administrations, Businesses and...
Samos2019Summit
 
Circular Flow of Income and Methods of Calculating National Income.docx
Circular Flow of Income and Methods of Calculating National Income.docxCircular Flow of Income and Methods of Calculating National Income.docx
Circular Flow of Income and Methods of Calculating National Income.docx
Babar Khan
 
1405 john swords investnet winning-tenders-may-2015- 16 9 ratio
1405 john swords investnet winning-tenders-may-2015- 16 9 ratio1405 john swords investnet winning-tenders-may-2015- 16 9 ratio
1405 john swords investnet winning-tenders-may-2015- 16 9 ratio
investnethealthcare
 
CEN/TC 445 newsletter 2017-1
CEN/TC 445 newsletter 2017-1CEN/TC 445 newsletter 2017-1
CEN/TC 445 newsletter 2017-1
Andrea Caccia
 
CPO ARENA Service Provider Synopsis (Real Sourcing Network)
CPO ARENA Service Provider Synopsis (Real Sourcing Network)CPO ARENA Service Provider Synopsis (Real Sourcing Network)
CPO ARENA Service Provider Synopsis (Real Sourcing Network)
CPOARENA
 
Tech Scouting (Companies) Workflow
Tech Scouting (Companies) WorkflowTech Scouting (Companies) Workflow
Tech Scouting (Companies) Workflow
quidsupport
 
Open source migration in public sector
Open source migration in public sectorOpen source migration in public sector
Open source migration in public sectorAndroklis Mavridis
 
vCon, an Open Standard for Conversation Data.pdf
vCon, an Open Standard for Conversation Data.pdfvCon, an Open Standard for Conversation Data.pdf
vCon, an Open Standard for Conversation Data.pdf
Alan Quayle
 
The trade desk (ttd)
The trade desk (ttd)The trade desk (ttd)
The trade desk (ttd)
Invbots Limited
 
WSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 Analytics
WSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 AnalyticsWSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 Analytics
WSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 Analytics
WSO2
 
Pricing project - High end running shoes
Pricing project - High end running shoesPricing project - High end running shoes
Pricing project - High end running shoes
Gabriel Delacroix
 

Similar to Machine learning on public procurement open data - ANAC case study (20)

Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...
Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...
Miguel Alvarez Rodriguez, digital government, public service delivery, SIGMA,...
 
SMU DRIVE SPRING 2017 MBA105 - MANAGERIAL ECONOMICS free solved assignment
 SMU DRIVE SPRING 2017  MBA105 - MANAGERIAL ECONOMICS free solved assignment SMU DRIVE SPRING 2017  MBA105 - MANAGERIAL ECONOMICS free solved assignment
SMU DRIVE SPRING 2017 MBA105 - MANAGERIAL ECONOMICS free solved assignment
 
Is small good kent seminar 6 march 2013
Is small good   kent seminar 6 march 2013Is small good   kent seminar 6 march 2013
Is small good kent seminar 6 march 2013
 
Vehicle Intercom System – Trend and Market Analysis
Vehicle Intercom System – Trend and Market AnalysisVehicle Intercom System – Trend and Market Analysis
Vehicle Intercom System – Trend and Market Analysis
 
Insurance playbook q32020 3
Insurance playbook q32020 3Insurance playbook q32020 3
Insurance playbook q32020 3
 
Public Private Partnership
Public Private Partnership Public Private Partnership
Public Private Partnership
 
Spend Network introduction to our data
Spend Network   introduction to our dataSpend Network   introduction to our data
Spend Network introduction to our data
 
Semantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop PerspectiveSemantic Data Enrichment: a Human-in-the-Loop Perspective
Semantic Data Enrichment: a Human-in-the-Loop Perspective
 
Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)Tech Scouting (Companies & Patents)
Tech Scouting (Companies & Patents)
 
Manylaws - Legal Informatics Services towards Administrations, Businesses and...
Manylaws - Legal Informatics Services towards Administrations, Businesses and...Manylaws - Legal Informatics Services towards Administrations, Businesses and...
Manylaws - Legal Informatics Services towards Administrations, Businesses and...
 
Circular Flow of Income and Methods of Calculating National Income.docx
Circular Flow of Income and Methods of Calculating National Income.docxCircular Flow of Income and Methods of Calculating National Income.docx
Circular Flow of Income and Methods of Calculating National Income.docx
 
1405 john swords investnet winning-tenders-may-2015- 16 9 ratio
1405 john swords investnet winning-tenders-may-2015- 16 9 ratio1405 john swords investnet winning-tenders-may-2015- 16 9 ratio
1405 john swords investnet winning-tenders-may-2015- 16 9 ratio
 
CEN/TC 445 newsletter 2017-1
CEN/TC 445 newsletter 2017-1CEN/TC 445 newsletter 2017-1
CEN/TC 445 newsletter 2017-1
 
CPO ARENA Service Provider Synopsis (Real Sourcing Network)
CPO ARENA Service Provider Synopsis (Real Sourcing Network)CPO ARENA Service Provider Synopsis (Real Sourcing Network)
CPO ARENA Service Provider Synopsis (Real Sourcing Network)
 
Tech Scouting (Companies) Workflow
Tech Scouting (Companies) WorkflowTech Scouting (Companies) Workflow
Tech Scouting (Companies) Workflow
 
Open source migration in public sector
Open source migration in public sectorOpen source migration in public sector
Open source migration in public sector
 
vCon, an Open Standard for Conversation Data.pdf
vCon, an Open Standard for Conversation Data.pdfvCon, an Open Standard for Conversation Data.pdf
vCon, an Open Standard for Conversation Data.pdf
 
The trade desk (ttd)
The trade desk (ttd)The trade desk (ttd)
The trade desk (ttd)
 
WSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 Analytics
WSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 AnalyticsWSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 Analytics
WSO2Con USA 2017: Discover Data That Matters: Deep Dive into WSO2 Analytics
 
Pricing project - High end running shoes
Pricing project - High end running shoesPricing project - High end running shoes
Pricing project - High end running shoes
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Machine learning on public procurement open data - ANAC case study

  • 1. Machine Learning on public procurement open data The ANAC case study
  • 4. ANAC Dataset - 4 Millions of public procurements extracted for the year 2017 - 14.000 Public Administrations (PA) - 500.000 private companies - We integrate data with other sources: - Indice PA: contains detailed information about PA - Open Consip: contains detailed information about 100.000 companies
  • 5. ANAC Dataset - cig: id of a public procurement (very noisy) - Info about pa: fiscal code, name - Info about the winner company and other competitors: fiscal code, name (very noisy) - Info about a public procurement: title, type, cost/revenue in euro, starting date, ending date
  • 6. What information can we extract? Who are the top companies in terms of won public procurements and their earnings? How their revenues depend on the public sector? 
  • 7. What information can we extract? Given a Company, show detailed information about won public procurements, types of Pa involved, region involved.
  • 8. What information can we extract? Given a Company, show detailed information about won public procurements, types of Pa involved, region involved.
  • 9. Existing Projects There are few existing projects that show that information
  • 11. From explicit relationships The Dataset contains only two types of relationships: - “Winner”
  • 12. From explicit relationships The Dataset contains only two types of relationships: - “Competitors”
  • 13. To hidden relationships Revenues Public Administrations Companies Idea: Extracting hidden relationships among Public Administrations and Private Companies using unstructured information as public procurement titles. Goal: - Identifying PA with same needs (i.e. require similar services); - Identifying indirect Competitors among Companies; - Help a PA to find companies (e.g. find all companies that sell wine)
  • 14. How to represent PA and Companies? Synthetic documents: for each PA and companies we concatenate their public procurement titles obtaining a synthetic document fornitura di carburante per automezzi comunali fornitura di carburante per automezzi comunali impegno di spesa per fornitura carburante per i mezzi in dotazione alla polizia locale fornitura carburante per mezzi in dotazione al gruppo comunale volontari protezione civileaib mediante adesione alla convenzione consip spa assunzione relativo impegno di spesa per il periodo 0101 2017 al 31122017 fornitura carburante comune di testico determina n 692017 fornitura carburante comuni di stellanello determina n 69 del 06032017 acquisto carburante panda e moto fornitura carburante per autotrazione mediante adesione alla convenzione consip spa tramite buoni carburante per il periodo 010931122017 Synthetic doc Multidimensional representation
  • 15. Analyze Unstructured Information Each document is represented in a continuous vector space. We consider three different representations: - Vector Space Model - Embedding Centroid - Weighted Embedding
  • 16. Vector Space Model Each document is represented as a vector with the size of the vocabulary, where each element gives the frequency of a word in the document or some weight derived from the frequency (tf-idf). Fornitura sale per disgelo stradale Fornitura sale uso disgelo d1 d2 cosine_sim(d1,d2) = 0.79
  • 17. Vector Space Model Each document is represented as a vector with the size of the vocabulary, where each element gives the frequency of a word in the document or some weight derived from the frequency. Impegno di spesa per acquisto sale da disgelo per il servizio di sgombero neve Fornitura sale per disgelo stradale VSM methods perform poorly on short documents (e.g. titles). This is simply because word overlap is minimal even for related documents. cosine_sim(d1,d2) = 0.33 d1 d3 ... ...
  • 18. Embedding centroids “You shall know a word by the company it keeps” (Firth 1957) - Word embeddings are powerful representations and contain a great deal of contextual information - Intuition: the context represents the semantics (e.g. smart and intelligents would have similar context) A medical doctor is a person who uses medicine to treat illness and injuries Some medical doctors only work in certain diseases or injuries Medical doctors examine, diagnose and treat patients Context is used to represent doctor/doctors
  • 20. Embedding centroids Each document is represented as the centroid between all its word vectors. Example: d1: Impegno spesa per acquisto sale da disgelo per servizio di sgombero neve d2: Fornitura sale per disgelo stradale d3: Fornitura per ufficio Impegno spesa acquisto disgelo neve servizio sale fornitura stradale d1 d2 cancelleriacartoleria ufficio d3 cosine_sim(d1,d2) = 0.80 cosine_sim(d1,d3) = 0.30
  • 21. Embedding centroids The intuition is that word embeddings will help in shorter technical text such as titles or abstracts, where exact word overlap may not often be enough. Using embeddings on long text adds some degree of noise and computing a centroid may lead to some information loss. Embedding for a boiler seller (cf: 02254030204) modificato determinazione sostituzione valvola caldaia materna comma manutenzione palestra alloggio idrauliche centralina scuola comunale via contrarre pavimento presso lettera
  • 22. Weighted Embedding We may compute a weighted centroid using the VSM vectors (i.e. TF-IDF) as coefficients. Embedding for a boiler seller (cf: 02254030204) modificato determinazione sostituzione valvola caldaia materna comma manutenzione palestra alloggio idrauliche centralina scuola comunale via contrarre pavimento presso
  • 25. Top documents for VSM There are 37 documents with cosine similarity greater than 0.4
  • 26. Top documents for Weighted Embedding There are more than 1000 documents with cosine similarity greater than 0.7
  • 27. Comparison top k similar documents SVM vs Emb. Centroid SVM vs Weighted Emb. Emb. Centroid vs Weighted Emb.