Pi school-dli-presentation de nobili

Pi Campus invests in applied AI startups
49 investments in 5 years
The deal: € 50-500K for 1-10% share
It is a seed stage venture fund and a
startup district
50% Italy
25% Europe
25% California
Pi Campus

Several times a year, we host a batch of the best engineers from all
over the world to turn them into AI specialists.
They apply their new skills on the industry project provided either by
their own employer, or by world leading tech companies such as
Google, Facebook and Amazon and fast-growing startups.
School of Artificial Intelligence

● Merit First
Top developers get in for free, and those who transfer from abroad
will receive a travel and accommodation grant.
● Learn by doing
Minimal teaching. Desks and environment are organised to
support small project teams, agile co-development, interactions
with mentor.
● Real world projects, no simulations
Our partners sponsor top developers to solve real challenges.
School of Artificial Intelligence

4
Managing Director Faculty Advisors
Some mentors from our pool
Qualified advisors for your project
Director of AI, Facebook Principal Applied Scientist, Amazon

5
Challenge portfolio
● Amazon: The Watercolour World
● Amazon: speech science with MXNet
● Translated: spoken language identification
● Translated: content-based translator scoring
● Wanderio: e-commerce fraud detection
● Lamco: mapping with AI
● Vatican Secret Archives: Latin OCR
● Kingcom: Food influencers
● Defenx: ransomware preemptive detection
● PwC: visual document classification
● PwC: interpretable machine learning
● Amex: loyalty programme email campaigns
● MiBACT, Italian ministry for heritage: searching Italy's art
● Atomikad: in-image advertising
● Veneto region: understanding hospital files
● Soldo: info extraction from receipts
● Covisian: call centre performance
● Xriba: tax codes on invoices
● Translated: customer lifetime value prediction
● Translated: adword optimisation
● Cisco: the ML platform for networking
● BNL: Basel II operational risk prediction
● BNL: ID scanning
● Poste: visual walk-in customer profiling
● Cisco: analysing hierarchical network data
● Cisco: reinforcement learning for wifi channel selection
● Cisco: privacy challenges in ML
● Cisco: combating Twitterbots
● Cisco: reconstructing depth from 2D images
● Engie: electricity imbalance prediction
● Cloudcare: lightspeed chat suggestions
● Employerland: AI for job coaching
● Sorgenia: customer care chatbot
● Inreach: sourcing investment opportunities
● Enel: AWS cloud optimization
● Enel X Colombia: propensity modelling
● Enel: electricity load forecast for homes
● Enel Distribution: medium voltage line SCADA fault prediction
● Enel Distribution: extreme weather impact on power lines
● Enel: IT helpdesk ticket routing
● Consiglio Nazionale del Notariato: AI for tomorrow's notary public
● Cloudcare: virtual call center supervisor
● Global and Local: improve rural municipalities' access to funding
● FASI: personalised tender alerts
● European Space Agency: earth observation
● Pryiatech: heart beat detection on video
● Freeda Media: lipstick recommender
● Radio Dimensione Suono: news picker
● Octo: fuel tank monitor

6
Next session: 29 November 2021
Register on Pi School’s website: School of AI /
Apply now
https://picampus-school.com/programme/school
-of-ai/

Leveraging NLP to achieve environmental sustainability
In collaboration with the Joint Research Centre (JRC) of the European Union
#pischool
Francesco Cariaggi, @fcariaggi
Cristiano De Nobili, PhD, @denocris
Sébastien Bratières, @Seb_Bratieres

The larger issue
● EU must stay competitive in environmental capabilities
● JRC tasked with informing policy by analysis
● JRC Circular Economy and Industrial Leadership Unit
○ compile BREF: e.g. paper, slaughterhouses, ceramics, waste water, iron and steel production
○ data-driven tools from Economic Complexity discipline
● BUT: goods classifications are made for customs, not environmental
assessment!
Our solution: Geographically map capabilities, represented by patents, with
BREF as queries.

What do NLP and Env. Sust. have in common?
We then built an Information Retrieval (IR) system based on Transformers that can
retrieve R&D relevant patents...
Ok, but what kind of patents?
At this stage of the project,
JRC was interested in Industrial Pollution (Patents4IPPC).

AI for Sustainability
As AI scientists or engineers, we can make a difference to the world.
With this project at Pi School we are doing our bit.
“If you have to stand in front of a computer for hours,
make sure that there is a strong mission behind the screen.”

Scope of the project
Given a BREF* passage find out the most relevant patents
Most relevant patents
- Patent 1
- Patent 2
- ...
*BREFs (Best Available Techniques Reference documents) are the result of a long, detailed and SOTA technical analysis
of the available techniques (consolidated and emergent) in the field of industrial pollution control.

Technical Challenges
The project might seem a simple Text Similarity task with BERT,
but this is not the case:
● Linguistic style mismatch between query (BREF) and response (patent);
● From Contextualized Word Embedding to Sentence Embedding for Semantic
Textual Similarity (STS);
● No training labeled data available (GS1* as a test set);
● Huge response database (about 10-20 M patents).
*GS1 (Gold Standard 1) is a dataset composed by a few pairs of BREF passage and corresponding relevant patent.

Linguistic Style Mismatch
BREF Passage (less technical):
Reduction of the amount of oxygen available in the combustion zone to the minimum amount needed for complete combustion and
for minimising NOX generation. The technique is mainly based on the minimisation of air leakages in the furnace, careful control of
the air used for combustion and a modified design of the furnace combustion chamber.
Patent Abstract (technical, some words are omitted, redundant):
Burner assembly and method for combustion of gaseous of liquid fuel. The invention relates to a burner assembly (1) and a method for
combustion of gaseous or liquid fuel to heat an industrial furnace (9) having a combustion chamber (2), at least one main combustion air
inlet (3) for the supply of preheated combustion air (4) into the combustion chamber (2), a burner (5) with at least one fuel feed (7) and at
least one air feed (8) for supply of fuel and primary air into a the combustion chamber (2), wherein the burner (5) is positioned adjacent to a
combustion zone of the combustion chamber (2) such that the combustion air (4) flowing into the combustion chamber (2) through the
main combustion air inlet (3) is passing the burner (5) in the combustion zone and is then deflected such that the flow of preheated
combustion air and the smaller flows of fuel and primary air are flowing mainly in parallel from the burner (5) to the furnace (9), and a control
unit for controlling the supply of fuel and maybe primary air into the combustion chamber (2). The control unit is adapted to supply the fuel
and/or the primary air from the fuel and/or air feed (7, 8) into the combustion chamber (2) with an exit velocity higher than 150 m/s.

Measuring Linguistic Style Mismatch
Loss Function: 0.97 1.0 1.67 2.29 1.10

Solving Linguistic Style Mismatch
patents
BREFs jargon
EU patents
bert-4-patents
Loss Function: 0.97 1.0 1.67 2.29 1.10
Original Checkpoint by Google Adaptive tuned using Masked Language Model
BREF docs & Patstat
Our Solution

Geometry of BERT
From Contextualized Word Embedding to Sentence Embedding for STS
BERT
“No planet B”
“No planet B” Non Contex. LM
(W2V, Glove)
planet
No
B
No
planet
B
“No planet B”
BERT embeddings do not
live in a Euclidean space
but something more
similar to a hyperbolic
space. Here we cannot
sum vectors or use cosine
similarity to measure their
distance. BERT is not a
good sentence embedder.
Hewitt & Manning 2019, arXiv:1906.02715, arXiv:1909.00512.

Geometry of BERT
From Contextualized Word Embedding to Sentence Embedding for STS
BERT
“No planet B”
“No planet B” Non Contex. LM
(W2V, Glove)
planet
No
B
No
planet
B
“No planet B”
This is related to the fact
that MLM training is not
optimized to treat all
embedding dimensions
equally.
Hewitt & Manning 2019, arXiv:1906.02715, arXiv:1909.00512.

Sentence BERT
A siamese network, when trained in a supervised way, is able to generate meaningful sentence embeddings
BERT
“No planet B”
BERT
“Save the planet”
“Let’s have a spritz”
[1, 0, …]
SentBERT: http://arxiv.org/abs/1908.10084

No training data available
The dataset that was provided by JRC (GS1) is composed of a few examples of BREF
passages and related patents.
Unfortunately, we could not rely on it to train our Siamese Network!
Then, our solution was to
Fine-tune the Net on two widely used STS
datasets (General English)
Fine-tune it on domain specific datasets.
They contains pairs of patents manually
labeled (1-3, 1-5) according to their similarity
STSb & NLI TREC-Chem & NTCIR GS1
TEST the Model

BERT for Patents
100M+ patents
BERTLARGE
BERT for Patents

Motivation
● Huge availability of data
○ Millions of patents issued every year in the world
● Word semantics is strongly context-specific
○ CPC B41J 2/165 (Nozzles for printing mechanisms)
“priming” is a synonym of: “cleaning”, “maintenance”, “recovery”
○ CPC C23G (Cleaning of metallic material)
“priming” is a synonym of: “anchoring”, “bonding”, “subbing”

Differences with BERTLARGE
● Special tokens identifying a specific section of the patent
[ABSTRACT], [CLAIM], [SUMMARY], [INVENTION]
○ 0.5% improvement in MLM performance
● 8000 additional words compared to the standard BERT vocabulary
○ Highly technical terms that BERT’s tokenizer would split into several subwords

Facebook AI Similarity Search
(FAISS)

Find nearest
neighbor(s)
Closest match(es)
BREF passage
(query)
Inference phase
Store vectors
Patents corpus
Index construction phase
FAISS: workflow
SentenceBERT
model
FAISS index

FAISS
FAISS is a library for efficient similarity search and clustering of dense vectors
● Implements algorithms for searching in sets of vectors (a.k.a. indices) of any
size, up to those that do not fit in RAM
○ Sharding, on-disk indices with memory mapping
● GPU optimized (CUDA)
● Accuracy/time tradeoff with exact/approximate search
● Memory saving with dimensionality reduction techniques (PCA)

FAISS: GPU performance
● The authors of FAISS claim a 5x - 10x speedup on a single GPU compared to
the corresponding CPU implementation
● If multiple GPUs are available, near-linear speedup over a single GPU can be
expected (6x - 7x with 8 GPUs)
● Experiments with a single GeForce GTX 1050 Ti Max-Q GPU:

Contributions
● Gathered several third-party datasets
● Solved the linguistic style mismatch using
fine-tuning ideas
● Analyzed multiple evaluation metrics
○ Spearman rank correlation, NDCG
● Enriched the Gold Standard dataset (GS1) by
submitting our model’s predictions to human
annotators
DualTransformer
Query
model
Response
model
Query Response
Similarity evaluator
(non-Euclidean)
Similarity score

Results
● Our final models largely outperform baseline approaches
● All metrics are to be intended as higher is better

Deployment and open source release
● JRC will start using our retrieval engine now
○ Proud to say that we exceeded their
expectations for this project
● Our software will be released soon on GitHub
under the GNU GPL-3.0 license
○ JRC’s GitHub repository: github.com/ec-jrc

Francesco Cariaggi (@FCariaggi)
Cristiano De Nobili, PhD (@denocris)
Sébastien Bratières (@Seb_Bratieres)
Thank you for your attention.
Leveraging NLP to achieve
environmental sustainability

Pi school-dli-presentation de nobili

Recommended

Recommended

More Related Content

Similar to Pi school-dli-presentation de nobili

Similar to Pi school-dli-presentation de nobili (20)

More from Deep Learning Italia

More from Deep Learning Italia (20)

Recently uploaded

Recently uploaded (20)

Pi school-dli-presentation de nobili