Machine Learning on public procurement
open data
The ANAC case study
ANAC Dataset
ANAC Dataset
ANAC Dataset
- 4 Millions of public procurements extracted for the year 2017
- 14.000 Public Administrations (PA)
- 500.000 private companies
- We integrate data with other sources:
- Indice PA: contains detailed information about PA
- Open Consip: contains detailed information about 100.000 companies
ANAC Dataset
- cig: id of a public procurement (very noisy)
- Info about pa: fiscal code, name
- Info about the winner company and other
competitors: fiscal code, name (very noisy)
- Info about a public procurement: title, type,
cost/revenue in euro, starting date, ending date
What information can we extract?
Who are the top companies in terms of won public procurements and their
earnings? How their revenues depend on the public sector?

What information can we extract?
Given a Company, show detailed information about won public procurements,
types of Pa involved, region involved.
What information can we extract?
Given a Company, show detailed information about won public procurements,
types of Pa involved, region involved.
Existing Projects
There are few existing projects that show that information
Anything else?
From explicit relationships
The Dataset contains only two types of relationships:
- “Winner”
From explicit relationships
The Dataset contains only two types of relationships:
- “Competitors”
To hidden relationships
Revenues
Public
Administrations
Companies
Idea: Extracting hidden relationships among Public Administrations and
Private Companies using unstructured information as public procurement titles.
Goal:
- Identifying PA with same needs
(i.e. require similar services);
- Identifying indirect Competitors
among Companies;
- Help a PA to find companies
(e.g. find all companies that sell
wine)
How to represent PA and Companies?
Synthetic documents: for each PA and companies we concatenate their public
procurement titles obtaining a synthetic document
fornitura di carburante per automezzi comunali
fornitura di carburante per automezzi comunali
impegno di spesa per fornitura carburante per i mezzi in
dotazione alla polizia locale
fornitura carburante per mezzi in dotazione al gruppo comunale
volontari protezione civileaib mediante adesione alla
convenzione consip spa assunzione relativo impegno di spesa
per il periodo 0101 2017 al 31122017
fornitura carburante comune di testico determina n 692017
fornitura carburante comuni di stellanello determina n 69 del
06032017
acquisto carburante panda e moto
fornitura carburante per autotrazione mediante adesione alla
convenzione consip spa tramite buoni carburante per il periodo
010931122017
Synthetic doc
Multidimensional
representation
Analyze Unstructured Information
Each document is represented in a continuous vector space.
We consider three different representations:
- Vector Space Model
- Embedding Centroid
- Weighted Embedding
Vector Space Model
Each document is represented as a vector with the size of the vocabulary, where
each element gives the frequency of a word in the document or some weight
derived from the frequency (tf-idf).
Fornitura sale per disgelo stradale
Fornitura sale uso disgelo
d1
d2
cosine_sim(d1,d2) = 0.79
Vector Space Model
Each document is represented as a vector with the size of the vocabulary, where
each element gives the frequency of a word in the document or some weight
derived from the frequency.
Impegno di spesa per acquisto sale da disgelo per
il servizio di sgombero neve
Fornitura sale per disgelo stradale
VSM methods perform poorly
on short documents (e.g.
titles). This is simply because
word overlap is minimal even
for related documents.
cosine_sim(d1,d2) = 0.33
d1
d3
...
...
Embedding centroids
“You shall know a word by the company it keeps” (Firth 1957)
- Word embeddings are powerful representations and contain a great deal of
contextual information
- Intuition: the context represents the semantics (e.g. smart and intelligents
would have similar context)
A medical doctor is a person who uses medicine to treat illness and injuries
Some medical doctors only work in certain diseases or injuries
Medical doctors examine, diagnose and treat patients
Context is used to represent
doctor/doctors
Word embedding Example
Embedding centroids
Each document is represented as the centroid between all its word vectors.
Example:
d1: Impegno spesa per acquisto sale da disgelo per servizio di sgombero neve
d2: Fornitura sale per disgelo stradale
d3: Fornitura per ufficio
Impegno
spesa
acquisto
disgelo
neve
servizio
sale
fornitura
stradale
d1
d2
cancelleriacartoleria
ufficio d3
cosine_sim(d1,d2) = 0.80
cosine_sim(d1,d3) = 0.30
Embedding centroids
The intuition is that word embeddings will help in shorter technical text such as
titles or abstracts, where exact word overlap may not often be enough.
Using embeddings on long text adds some degree of noise and computing a
centroid may lead to some information loss.
Embedding for a boiler seller
(cf: 02254030204)
modificato
determinazione
sostituzione
valvola
caldaia
materna
comma manutenzione
palestra alloggio
idrauliche
centralina
scuola
comunale
via
contrarre
pavimento
presso
lettera
Weighted Embedding
We may compute a weighted centroid using the VSM vectors (i.e. TF-IDF) as
coefficients.
Embedding for a boiler seller
(cf: 02254030204)
modificato
determinazione
sostituzione
valvola
caldaia
materna
comma manutenzione
palestra alloggio
idrauliche
centralina
scuola
comunale
via
contrarre
pavimento
presso
Example
Common top documents
Top documents for VSM
There are 37 documents with cosine similarity greater than 0.4
Top documents for Weighted Embedding
There are more than 1000 documents with cosine similarity greater than 0.7
Comparison top k similar documents
SVM vs Emb.
Centroid
SVM vs Weighted
Emb.
Emb. Centroid vs
Weighted Emb.
Demo
Fork the source code!
Fabiana Lanotte
Data Engineer
fabiana@teamdigitale.governo.it
Linkedin: fabiana-lanotte
Github: fabiana001

Machine learning on public procurement open data - ANAC case study

  • 1.
    Machine Learning onpublic procurement open data The ANAC case study
  • 2.
  • 3.
  • 4.
    ANAC Dataset - 4Millions of public procurements extracted for the year 2017 - 14.000 Public Administrations (PA) - 500.000 private companies - We integrate data with other sources: - Indice PA: contains detailed information about PA - Open Consip: contains detailed information about 100.000 companies
  • 5.
    ANAC Dataset - cig:id of a public procurement (very noisy) - Info about pa: fiscal code, name - Info about the winner company and other competitors: fiscal code, name (very noisy) - Info about a public procurement: title, type, cost/revenue in euro, starting date, ending date
  • 6.
    What information canwe extract? Who are the top companies in terms of won public procurements and their earnings? How their revenues depend on the public sector? 
  • 7.
    What information canwe extract? Given a Company, show detailed information about won public procurements, types of Pa involved, region involved.
  • 8.
    What information canwe extract? Given a Company, show detailed information about won public procurements, types of Pa involved, region involved.
  • 9.
    Existing Projects There arefew existing projects that show that information
  • 10.
  • 11.
    From explicit relationships TheDataset contains only two types of relationships: - “Winner”
  • 12.
    From explicit relationships TheDataset contains only two types of relationships: - “Competitors”
  • 13.
    To hidden relationships Revenues Public Administrations Companies Idea:Extracting hidden relationships among Public Administrations and Private Companies using unstructured information as public procurement titles. Goal: - Identifying PA with same needs (i.e. require similar services); - Identifying indirect Competitors among Companies; - Help a PA to find companies (e.g. find all companies that sell wine)
  • 14.
    How to representPA and Companies? Synthetic documents: for each PA and companies we concatenate their public procurement titles obtaining a synthetic document fornitura di carburante per automezzi comunali fornitura di carburante per automezzi comunali impegno di spesa per fornitura carburante per i mezzi in dotazione alla polizia locale fornitura carburante per mezzi in dotazione al gruppo comunale volontari protezione civileaib mediante adesione alla convenzione consip spa assunzione relativo impegno di spesa per il periodo 0101 2017 al 31122017 fornitura carburante comune di testico determina n 692017 fornitura carburante comuni di stellanello determina n 69 del 06032017 acquisto carburante panda e moto fornitura carburante per autotrazione mediante adesione alla convenzione consip spa tramite buoni carburante per il periodo 010931122017 Synthetic doc Multidimensional representation
  • 15.
    Analyze Unstructured Information Eachdocument is represented in a continuous vector space. We consider three different representations: - Vector Space Model - Embedding Centroid - Weighted Embedding
  • 16.
    Vector Space Model Eachdocument is represented as a vector with the size of the vocabulary, where each element gives the frequency of a word in the document or some weight derived from the frequency (tf-idf). Fornitura sale per disgelo stradale Fornitura sale uso disgelo d1 d2 cosine_sim(d1,d2) = 0.79
  • 17.
    Vector Space Model Eachdocument is represented as a vector with the size of the vocabulary, where each element gives the frequency of a word in the document or some weight derived from the frequency. Impegno di spesa per acquisto sale da disgelo per il servizio di sgombero neve Fornitura sale per disgelo stradale VSM methods perform poorly on short documents (e.g. titles). This is simply because word overlap is minimal even for related documents. cosine_sim(d1,d2) = 0.33 d1 d3 ... ...
  • 18.
    Embedding centroids “You shallknow a word by the company it keeps” (Firth 1957) - Word embeddings are powerful representations and contain a great deal of contextual information - Intuition: the context represents the semantics (e.g. smart and intelligents would have similar context) A medical doctor is a person who uses medicine to treat illness and injuries Some medical doctors only work in certain diseases or injuries Medical doctors examine, diagnose and treat patients Context is used to represent doctor/doctors
  • 19.
  • 20.
    Embedding centroids Each documentis represented as the centroid between all its word vectors. Example: d1: Impegno spesa per acquisto sale da disgelo per servizio di sgombero neve d2: Fornitura sale per disgelo stradale d3: Fornitura per ufficio Impegno spesa acquisto disgelo neve servizio sale fornitura stradale d1 d2 cancelleriacartoleria ufficio d3 cosine_sim(d1,d2) = 0.80 cosine_sim(d1,d3) = 0.30
  • 21.
    Embedding centroids The intuitionis that word embeddings will help in shorter technical text such as titles or abstracts, where exact word overlap may not often be enough. Using embeddings on long text adds some degree of noise and computing a centroid may lead to some information loss. Embedding for a boiler seller (cf: 02254030204) modificato determinazione sostituzione valvola caldaia materna comma manutenzione palestra alloggio idrauliche centralina scuola comunale via contrarre pavimento presso lettera
  • 22.
    Weighted Embedding We maycompute a weighted centroid using the VSM vectors (i.e. TF-IDF) as coefficients. Embedding for a boiler seller (cf: 02254030204) modificato determinazione sostituzione valvola caldaia materna comma manutenzione palestra alloggio idrauliche centralina scuola comunale via contrarre pavimento presso
  • 23.
  • 24.
  • 25.
    Top documents forVSM There are 37 documents with cosine similarity greater than 0.4
  • 26.
    Top documents forWeighted Embedding There are more than 1000 documents with cosine similarity greater than 0.7
  • 27.
    Comparison top ksimilar documents SVM vs Emb. Centroid SVM vs Weighted Emb. Emb. Centroid vs Weighted Emb.
  • 28.
  • 29.