SlideShare a Scribd company logo
Pi Campus invests in applied AI startups
49 investments in 5 years
The deal: € 50-500K for 1-10% share
It is a seed stage venture fund and a
startup district
50% Italy
25% Europe
25% California
Pi Campus
Several times a year, we host a batch of the best engineers from all
over the world to turn them into AI specialists.
They apply their new skills on the industry project provided either by
their own employer, or by world leading tech companies such as
Google, Facebook and Amazon and fast-growing startups.
School of Artificial Intelligence
● Merit First
Top developers get in for free, and those who transfer from abroad
will receive a travel and accommodation grant.
● Learn by doing
Minimal teaching. Desks and environment are organised to
support small project teams, agile co-development, interactions
with mentor.
● Real world projects, no simulations
Our partners sponsor top developers to solve real challenges.
School of Artificial Intelligence
4
Managing Director Faculty Advisors
Some mentors from our pool
Qualified advisors for your project
Director of AI, Facebook Principal Applied Scientist, Amazon
5
Challenge portfolio
● Amazon: The Watercolour World
● Amazon: speech science with MXNet
● Translated: spoken language identification
● Translated: content-based translator scoring
● Wanderio: e-commerce fraud detection
● Lamco: mapping with AI
● Vatican Secret Archives: Latin OCR
● Kingcom: Food influencers
● Defenx: ransomware preemptive detection
● PwC: visual document classification
● PwC: interpretable machine learning
● Amex: loyalty programme email campaigns
● MiBACT, Italian ministry for heritage: searching Italy's art
● Atomikad: in-image advertising
● Veneto region: understanding hospital files
● Soldo: info extraction from receipts
● Covisian: call centre performance
● Xriba: tax codes on invoices
● Translated: customer lifetime value prediction
● Translated: adword optimisation
● Cisco: the ML platform for networking
● BNL: Basel II operational risk prediction
● BNL: ID scanning
● Poste: visual walk-in customer profiling
● Cisco: analysing hierarchical network data
● Cisco: reinforcement learning for wifi channel selection
● Cisco: privacy challenges in ML
● Cisco: combating Twitterbots
● Cisco: reconstructing depth from 2D images
● Engie: electricity imbalance prediction
● Cloudcare: lightspeed chat suggestions
● Employerland: AI for job coaching
● Sorgenia: customer care chatbot
● Inreach: sourcing investment opportunities
● Enel: AWS cloud optimization
● Enel X Colombia: propensity modelling
● Enel: electricity load forecast for homes
● Enel Distribution: medium voltage line SCADA fault prediction
● Enel Distribution: extreme weather impact on power lines
● Enel: IT helpdesk ticket routing
● Consiglio Nazionale del Notariato: AI for tomorrow's notary public
● Cloudcare: virtual call center supervisor
● Global and Local: improve rural municipalities' access to funding
● FASI: personalised tender alerts
● European Space Agency: earth observation
● Pryiatech: heart beat detection on video
● Freeda Media: lipstick recommender
● Radio Dimensione Suono: news picker
● Octo: fuel tank monitor
6
Next session: 29 November 2021
Register on Pi School’s website: School of AI /
Apply now
https://picampus-school.com/programme/school
-of-ai/
Leveraging NLP to achieve environmental sustainability
In collaboration with the Joint Research Centre (JRC) of the European Union
#pischool
Francesco Cariaggi, @fcariaggi
Cristiano De Nobili, PhD, @denocris
Sébastien Bratières, @Seb_Bratieres
The larger issue
● EU must stay competitive in environmental capabilities
● JRC tasked with informing policy by analysis
● JRC Circular Economy and Industrial Leadership Unit
○ compile BREF: e.g. paper, slaughterhouses, ceramics, waste water, iron and steel production
○ data-driven tools from Economic Complexity discipline
● BUT: goods classifications are made for customs, not environmental
assessment!
Our solution: Geographically map capabilities, represented by patents, with
BREF as queries.
What do NLP and Env. Sust. have in common?
We then built an Information Retrieval (IR) system based on Transformers that can
retrieve R&D relevant patents...
Ok, but what kind of patents?
At this stage of the project,
JRC was interested in Industrial Pollution (Patents4IPPC).
AI for Sustainability
As AI scientists or engineers, we can make a difference to the world.
With this project at Pi School we are doing our bit.
“If you have to stand in front of a computer for hours,
make sure that there is a strong mission behind the screen.”
Scope of the project
Given a BREF* passage find out the most relevant patents
Most relevant patents
- Patent 1
- Patent 2
- ...
*BREFs (Best Available Techniques Reference documents) are the result of a long, detailed and SOTA technical analysis
of the available techniques (consolidated and emergent) in the field of industrial pollution control.
Technical Challenges
The project might seem a simple Text Similarity task with BERT,
but this is not the case:
● Linguistic style mismatch between query (BREF) and response (patent);
● From Contextualized Word Embedding to Sentence Embedding for Semantic
Textual Similarity (STS);
● No training labeled data available (GS1* as a test set);
● Huge response database (about 10-20 M patents).
*GS1 (Gold Standard 1) is a dataset composed by a few pairs of BREF passage and corresponding relevant patent.
Linguistic Style Mismatch
BREF Passage (less technical):
Reduction of the amount of oxygen available in the combustion zone to the minimum amount needed for complete combustion and
for minimising NOX generation. The technique is mainly based on the minimisation of air leakages in the furnace, careful control of
the air used for combustion and a modified design of the furnace combustion chamber.
Patent Abstract (technical, some words are omitted, redundant):
Burner assembly and method for combustion of gaseous of liquid fuel. The invention relates to a burner assembly (1) and a method for
combustion of gaseous or liquid fuel to heat an industrial furnace (9) having a combustion chamber (2), at least one main combustion air
inlet (3) for the supply of preheated combustion air (4) into the combustion chamber (2), a burner (5) with at least one fuel feed (7) and at
least one air feed (8) for supply of fuel and primary air into a the combustion chamber (2), wherein the burner (5) is positioned adjacent to a
combustion zone of the combustion chamber (2) such that the combustion air (4) flowing into the combustion chamber (2) through the
main combustion air inlet (3) is passing the burner (5) in the combustion zone and is then deflected such that the flow of preheated
combustion air and the smaller flows of fuel and primary air are flowing mainly in parallel from the burner (5) to the furnace (9), and a control
unit for controlling the supply of fuel and maybe primary air into the combustion chamber (2). The control unit is adapted to supply the fuel
and/or the primary air from the fuel and/or air feed (7, 8) into the combustion chamber (2) with an exit velocity higher than 150 m/s.
Measuring Linguistic Style Mismatch
Loss Function: 0.97 1.0 1.67 2.29 1.10
Solving Linguistic Style Mismatch
patents
BREFs jargon
EU patents
bert-4-patents
Loss Function: 0.97 1.0 1.67 2.29 1.10
Original Checkpoint by Google Adaptive tuned using Masked Language Model
BREF docs & Patstat
Our Solution
Geometry of BERT
From Contextualized Word Embedding to Sentence Embedding for STS
BERT
“No planet B”
“No planet B” Non Contex. LM
(W2V, Glove)
planet
No
B
No
planet
B
“No planet B”
BERT embeddings do not
live in a Euclidean space
but something more
similar to a hyperbolic
space. Here we cannot
sum vectors or use cosine
similarity to measure their
distance. BERT is not a
good sentence embedder.
Hewitt & Manning 2019, arXiv:1906.02715, arXiv:1909.00512.
Geometry of BERT
From Contextualized Word Embedding to Sentence Embedding for STS
BERT
“No planet B”
“No planet B” Non Contex. LM
(W2V, Glove)
planet
No
B
No
planet
B
“No planet B”
This is related to the fact
that MLM training is not
optimized to treat all
embedding dimensions
equally.
Hewitt & Manning 2019, arXiv:1906.02715, arXiv:1909.00512.
Sentence BERT
A siamese network, when trained in a supervised way, is able to generate meaningful sentence embeddings
BERT
“No planet B”
BERT
“Save the planet”
“Let’s have a spritz”
[1, 0, …]
SentBERT: http://arxiv.org/abs/1908.10084
No training data available
The dataset that was provided by JRC (GS1) is composed of a few examples of BREF
passages and related patents.
Unfortunately, we could not rely on it to train our Siamese Network!
Then, our solution was to
Fine-tune the Net on two widely used STS
datasets (General English)
Fine-tune it on domain specific datasets.
They contains pairs of patents manually
labeled (1-3, 1-5) according to their similarity
STSb & NLI TREC-Chem & NTCIR GS1
TEST the Model
BERT for Patents
BERT for Patents
100M+ patents
BERTLARGE
BERT for Patents
Motivation
● Huge availability of data
○ Millions of patents issued every year in the world
● Word semantics is strongly context-specific
○ CPC B41J 2/165 (Nozzles for printing mechanisms)
“priming” is a synonym of: “cleaning”, “maintenance”, “recovery”
○ CPC C23G (Cleaning of metallic material)
“priming” is a synonym of: “anchoring”, “bonding”, “subbing”
Differences with BERTLARGE
● Special tokens identifying a specific section of the patent
[ABSTRACT], [CLAIM], [SUMMARY], [INVENTION]
○ 0.5% improvement in MLM performance
● 8000 additional words compared to the standard BERT vocabulary
○ Highly technical terms that BERT’s tokenizer would split into several subwords
Facebook AI Similarity Search
(FAISS)
Find nearest
neighbor(s)
Closest match(es)
BREF passage
(query)
Inference phase
Store vectors
Patents corpus
Index construction phase
FAISS: workflow
SentenceBERT
model
FAISS index
FAISS
FAISS is a library for efficient similarity search and clustering of dense vectors
● Implements algorithms for searching in sets of vectors (a.k.a. indices) of any
size, up to those that do not fit in RAM
○ Sharding, on-disk indices with memory mapping
● GPU optimized (CUDA)
● Accuracy/time tradeoff with exact/approximate search
● Memory saving with dimensionality reduction techniques (PCA)
FAISS: GPU performance
● The authors of FAISS claim a 5x - 10x speedup on a single GPU compared to
the corresponding CPU implementation
● If multiple GPUs are available, near-linear speedup over a single GPU can be
expected (6x - 7x with 8 GPUs)
● Experiments with a single GeForce GTX 1050 Ti Max-Q GPU:
Contributions
Contributions
● Gathered several third-party datasets
● Solved the linguistic style mismatch using
fine-tuning ideas
● Analyzed multiple evaluation metrics
○ Spearman rank correlation, NDCG
● Enriched the Gold Standard dataset (GS1) by
submitting our model’s predictions to human
annotators
DualTransformer
Query
model
Response
model
Query Response
Similarity evaluator
(non-Euclidean)
Similarity score
Results
● Our final models largely outperform baseline approaches
● All metrics are to be intended as higher is better
Deployment and open source release
● JRC will start using our retrieval engine now
○ Proud to say that we exceeded their
expectations for this project
● Our software will be released soon on GitHub
under the GNU GPL-3.0 license
○ JRC’s GitHub repository: github.com/ec-jrc
Francesco Cariaggi (@FCariaggi)
Cristiano De Nobili, PhD (@denocris)
Sébastien Bratières (@Seb_Bratieres)
Thank you for your attention.
Leveraging NLP to achieve
environmental sustainability

More Related Content

Similar to Pi school-dli-presentation de nobili

Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?
Förderverein Technische Fakultät
 
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentationHiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
VEDLIoT Project
 
Introduction to STILT – an on-demand CO2 footprint calculator service
Introduction to STILT – an on-demand CO2 footprint calculator serviceIntroduction to STILT – an on-demand CO2 footprint calculator service
Introduction to STILT – an on-demand CO2 footprint calculator service
EUDAT
 
Réveil en Form' - Dual use - Verhaert
Réveil en Form' - Dual use - VerhaertRéveil en Form' - Dual use - Verhaert
Réveil en Form' - Dual use - Verhaert
Alain Krafft
 
09.50 Ernst Vrolijks
09.50 Ernst Vrolijks09.50 Ernst Vrolijks
09.50 Ernst Vrolijks
Themadagen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)
MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)
MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)
TAUS - The Language Data Network
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
Archiver
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
Förderverein Technische Fakultät
 
Collaboration with industry: success stories
Collaboration with industry: success storiesCollaboration with industry: success stories
Collaboration with industry: success stories
EPCC, University of Edinburgh
 
TWISummit 2019 - Return of Reconfigurable Computing
TWISummit 2019 - Return of Reconfigurable ComputingTWISummit 2019 - Return of Reconfigurable Computing
TWISummit 2019 - Return of Reconfigurable Computing
Thoughtworks
 
SDN-based Inter-Cloud Federation for OF@TEIN
SDN-based Inter-Cloud Federation for OF@TEINSDN-based Inter-Cloud Federation for OF@TEIN
SDN-based Inter-Cloud Federation for OF@TEIN
GIST (Gwangju Institute of Science and Technology)
 
Workshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptxWorkshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptx
Marco Tibaldi
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
Facultad de Informática UCM
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Intel® Software
 
Greendroid ppt
Greendroid pptGreendroid ppt
Greendroid ppt
Shreyas Kardalli
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
Nurturing Business Friendly Open Source Ecosystems
Nurturing Business Friendly Open Source EcosystemsNurturing Business Friendly Open Source Ecosystems
Nurturing Business Friendly Open Source Ecosystems
Gaël Blondelle
 
Industry4.0 IoT Vincent Thavonekham - Azure Day Ukraine
Industry4.0 IoT Vincent Thavonekham - Azure Day UkraineIndustry4.0 IoT Vincent Thavonekham - Azure Day Ukraine
Industry4.0 IoT Vincent Thavonekham - Azure Day Ukraine
FactoVia
 

Similar to Pi school-dli-presentation de nobili (20)

Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?
 
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentationHiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
HiPEAC2023-DL4IoT Workshop_Jean Hagemeyer presentation
 
Introduction to STILT – an on-demand CO2 footprint calculator service
Introduction to STILT – an on-demand CO2 footprint calculator serviceIntroduction to STILT – an on-demand CO2 footprint calculator service
Introduction to STILT – an on-demand CO2 footprint calculator service
 
Réveil en Form' - Dual use - Verhaert
Réveil en Form' - Dual use - VerhaertRéveil en Form' - Dual use - Verhaert
Réveil en Form' - Dual use - Verhaert
 
09.50 Ernst Vrolijks
09.50 Ernst Vrolijks09.50 Ernst Vrolijks
09.50 Ernst Vrolijks
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)
MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)
MMT – Modern, Next Generation Machine Translation, Achim Ruopp (TAUS)
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
The Computing Continuum.pdf
The Computing Continuum.pdfThe Computing Continuum.pdf
The Computing Continuum.pdf
 
Collaboration with industry: success stories
Collaboration with industry: success storiesCollaboration with industry: success stories
Collaboration with industry: success stories
 
TWISummit 2019 - Return of Reconfigurable Computing
TWISummit 2019 - Return of Reconfigurable ComputingTWISummit 2019 - Return of Reconfigurable Computing
TWISummit 2019 - Return of Reconfigurable Computing
 
SDN-based Inter-Cloud Federation for OF@TEIN
SDN-based Inter-Cloud Federation for OF@TEINSDN-based Inter-Cloud Federation for OF@TEIN
SDN-based Inter-Cloud Federation for OF@TEIN
 
Workshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptxWorkshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptx
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Greendroid ppt
Greendroid pptGreendroid ppt
Greendroid ppt
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Nurturing Business Friendly Open Source Ecosystems
Nurturing Business Friendly Open Source EcosystemsNurturing Business Friendly Open Source Ecosystems
Nurturing Business Friendly Open Source Ecosystems
 
Industry4.0 IoT Vincent Thavonekham - Azure Day Ukraine
Industry4.0 IoT Vincent Thavonekham - Azure Day UkraineIndustry4.0 IoT Vincent Thavonekham - Azure Day Ukraine
Industry4.0 IoT Vincent Thavonekham - Azure Day Ukraine
 

More from Deep Learning Italia

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
Deep Learning Italia
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Deep Learning Italia
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
Deep Learning Italia
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
Deep Learning Italia
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
Deep Learning Italia
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
Deep Learning Italia
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
Deep Learning Italia
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
Deep Learning Italia
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
Deep Learning Italia
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
Deep Learning Italia
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
Deep Learning Italia
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
Deep Learning Italia
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
Deep Learning Italia
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
Deep Learning Italia
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
Deep Learning Italia
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
Deep Learning Italia
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
Deep Learning Italia
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
Deep Learning Italia
 
Algoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time SeriesAlgoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time Series
Deep Learning Italia
 

More from Deep Learning Italia (20)

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
 
Algoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time SeriesAlgoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time Series
 

Recently uploaded

ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
MastanaihnaiduYasam
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 

Recently uploaded (20)

ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative ClassifiersML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 

Pi school-dli-presentation de nobili

  • 1. Pi Campus invests in applied AI startups 49 investments in 5 years The deal: € 50-500K for 1-10% share It is a seed stage venture fund and a startup district 50% Italy 25% Europe 25% California Pi Campus
  • 2. Several times a year, we host a batch of the best engineers from all over the world to turn them into AI specialists. They apply their new skills on the industry project provided either by their own employer, or by world leading tech companies such as Google, Facebook and Amazon and fast-growing startups. School of Artificial Intelligence
  • 3. ● Merit First Top developers get in for free, and those who transfer from abroad will receive a travel and accommodation grant. ● Learn by doing Minimal teaching. Desks and environment are organised to support small project teams, agile co-development, interactions with mentor. ● Real world projects, no simulations Our partners sponsor top developers to solve real challenges. School of Artificial Intelligence
  • 4. 4 Managing Director Faculty Advisors Some mentors from our pool Qualified advisors for your project Director of AI, Facebook Principal Applied Scientist, Amazon
  • 5. 5 Challenge portfolio ● Amazon: The Watercolour World ● Amazon: speech science with MXNet ● Translated: spoken language identification ● Translated: content-based translator scoring ● Wanderio: e-commerce fraud detection ● Lamco: mapping with AI ● Vatican Secret Archives: Latin OCR ● Kingcom: Food influencers ● Defenx: ransomware preemptive detection ● PwC: visual document classification ● PwC: interpretable machine learning ● Amex: loyalty programme email campaigns ● MiBACT, Italian ministry for heritage: searching Italy's art ● Atomikad: in-image advertising ● Veneto region: understanding hospital files ● Soldo: info extraction from receipts ● Covisian: call centre performance ● Xriba: tax codes on invoices ● Translated: customer lifetime value prediction ● Translated: adword optimisation ● Cisco: the ML platform for networking ● BNL: Basel II operational risk prediction ● BNL: ID scanning ● Poste: visual walk-in customer profiling ● Cisco: analysing hierarchical network data ● Cisco: reinforcement learning for wifi channel selection ● Cisco: privacy challenges in ML ● Cisco: combating Twitterbots ● Cisco: reconstructing depth from 2D images ● Engie: electricity imbalance prediction ● Cloudcare: lightspeed chat suggestions ● Employerland: AI for job coaching ● Sorgenia: customer care chatbot ● Inreach: sourcing investment opportunities ● Enel: AWS cloud optimization ● Enel X Colombia: propensity modelling ● Enel: electricity load forecast for homes ● Enel Distribution: medium voltage line SCADA fault prediction ● Enel Distribution: extreme weather impact on power lines ● Enel: IT helpdesk ticket routing ● Consiglio Nazionale del Notariato: AI for tomorrow's notary public ● Cloudcare: virtual call center supervisor ● Global and Local: improve rural municipalities' access to funding ● FASI: personalised tender alerts ● European Space Agency: earth observation ● Pryiatech: heart beat detection on video ● Freeda Media: lipstick recommender ● Radio Dimensione Suono: news picker ● Octo: fuel tank monitor
  • 6. 6 Next session: 29 November 2021 Register on Pi School’s website: School of AI / Apply now https://picampus-school.com/programme/school -of-ai/
  • 7. Leveraging NLP to achieve environmental sustainability In collaboration with the Joint Research Centre (JRC) of the European Union #pischool Francesco Cariaggi, @fcariaggi Cristiano De Nobili, PhD, @denocris Sébastien Bratières, @Seb_Bratieres
  • 8. The larger issue ● EU must stay competitive in environmental capabilities ● JRC tasked with informing policy by analysis ● JRC Circular Economy and Industrial Leadership Unit ○ compile BREF: e.g. paper, slaughterhouses, ceramics, waste water, iron and steel production ○ data-driven tools from Economic Complexity discipline ● BUT: goods classifications are made for customs, not environmental assessment! Our solution: Geographically map capabilities, represented by patents, with BREF as queries.
  • 9. What do NLP and Env. Sust. have in common? We then built an Information Retrieval (IR) system based on Transformers that can retrieve R&D relevant patents... Ok, but what kind of patents? At this stage of the project, JRC was interested in Industrial Pollution (Patents4IPPC).
  • 10. AI for Sustainability As AI scientists or engineers, we can make a difference to the world. With this project at Pi School we are doing our bit. “If you have to stand in front of a computer for hours, make sure that there is a strong mission behind the screen.”
  • 11. Scope of the project Given a BREF* passage find out the most relevant patents Most relevant patents - Patent 1 - Patent 2 - ... *BREFs (Best Available Techniques Reference documents) are the result of a long, detailed and SOTA technical analysis of the available techniques (consolidated and emergent) in the field of industrial pollution control.
  • 12. Technical Challenges The project might seem a simple Text Similarity task with BERT, but this is not the case: ● Linguistic style mismatch between query (BREF) and response (patent); ● From Contextualized Word Embedding to Sentence Embedding for Semantic Textual Similarity (STS); ● No training labeled data available (GS1* as a test set); ● Huge response database (about 10-20 M patents). *GS1 (Gold Standard 1) is a dataset composed by a few pairs of BREF passage and corresponding relevant patent.
  • 13. Linguistic Style Mismatch BREF Passage (less technical): Reduction of the amount of oxygen available in the combustion zone to the minimum amount needed for complete combustion and for minimising NOX generation. The technique is mainly based on the minimisation of air leakages in the furnace, careful control of the air used for combustion and a modified design of the furnace combustion chamber. Patent Abstract (technical, some words are omitted, redundant): Burner assembly and method for combustion of gaseous of liquid fuel. The invention relates to a burner assembly (1) and a method for combustion of gaseous or liquid fuel to heat an industrial furnace (9) having a combustion chamber (2), at least one main combustion air inlet (3) for the supply of preheated combustion air (4) into the combustion chamber (2), a burner (5) with at least one fuel feed (7) and at least one air feed (8) for supply of fuel and primary air into a the combustion chamber (2), wherein the burner (5) is positioned adjacent to a combustion zone of the combustion chamber (2) such that the combustion air (4) flowing into the combustion chamber (2) through the main combustion air inlet (3) is passing the burner (5) in the combustion zone and is then deflected such that the flow of preheated combustion air and the smaller flows of fuel and primary air are flowing mainly in parallel from the burner (5) to the furnace (9), and a control unit for controlling the supply of fuel and maybe primary air into the combustion chamber (2). The control unit is adapted to supply the fuel and/or the primary air from the fuel and/or air feed (7, 8) into the combustion chamber (2) with an exit velocity higher than 150 m/s.
  • 14. Measuring Linguistic Style Mismatch Loss Function: 0.97 1.0 1.67 2.29 1.10
  • 15. Solving Linguistic Style Mismatch patents BREFs jargon EU patents bert-4-patents Loss Function: 0.97 1.0 1.67 2.29 1.10 Original Checkpoint by Google Adaptive tuned using Masked Language Model BREF docs & Patstat Our Solution
  • 16. Geometry of BERT From Contextualized Word Embedding to Sentence Embedding for STS BERT “No planet B” “No planet B” Non Contex. LM (W2V, Glove) planet No B No planet B “No planet B” BERT embeddings do not live in a Euclidean space but something more similar to a hyperbolic space. Here we cannot sum vectors or use cosine similarity to measure their distance. BERT is not a good sentence embedder. Hewitt & Manning 2019, arXiv:1906.02715, arXiv:1909.00512.
  • 17. Geometry of BERT From Contextualized Word Embedding to Sentence Embedding for STS BERT “No planet B” “No planet B” Non Contex. LM (W2V, Glove) planet No B No planet B “No planet B” This is related to the fact that MLM training is not optimized to treat all embedding dimensions equally. Hewitt & Manning 2019, arXiv:1906.02715, arXiv:1909.00512.
  • 18. Sentence BERT A siamese network, when trained in a supervised way, is able to generate meaningful sentence embeddings BERT “No planet B” BERT “Save the planet” “Let’s have a spritz” [1, 0, …] SentBERT: http://arxiv.org/abs/1908.10084
  • 19. No training data available The dataset that was provided by JRC (GS1) is composed of a few examples of BREF passages and related patents. Unfortunately, we could not rely on it to train our Siamese Network! Then, our solution was to Fine-tune the Net on two widely used STS datasets (General English) Fine-tune it on domain specific datasets. They contains pairs of patents manually labeled (1-3, 1-5) according to their similarity STSb & NLI TREC-Chem & NTCIR GS1 TEST the Model
  • 21. BERT for Patents 100M+ patents BERTLARGE BERT for Patents
  • 22. Motivation ● Huge availability of data ○ Millions of patents issued every year in the world ● Word semantics is strongly context-specific ○ CPC B41J 2/165 (Nozzles for printing mechanisms) “priming” is a synonym of: “cleaning”, “maintenance”, “recovery” ○ CPC C23G (Cleaning of metallic material) “priming” is a synonym of: “anchoring”, “bonding”, “subbing”
  • 23. Differences with BERTLARGE ● Special tokens identifying a specific section of the patent [ABSTRACT], [CLAIM], [SUMMARY], [INVENTION] ○ 0.5% improvement in MLM performance ● 8000 additional words compared to the standard BERT vocabulary ○ Highly technical terms that BERT’s tokenizer would split into several subwords
  • 24. Facebook AI Similarity Search (FAISS)
  • 25. Find nearest neighbor(s) Closest match(es) BREF passage (query) Inference phase Store vectors Patents corpus Index construction phase FAISS: workflow SentenceBERT model FAISS index
  • 26. FAISS FAISS is a library for efficient similarity search and clustering of dense vectors ● Implements algorithms for searching in sets of vectors (a.k.a. indices) of any size, up to those that do not fit in RAM ○ Sharding, on-disk indices with memory mapping ● GPU optimized (CUDA) ● Accuracy/time tradeoff with exact/approximate search ● Memory saving with dimensionality reduction techniques (PCA)
  • 27. FAISS: GPU performance ● The authors of FAISS claim a 5x - 10x speedup on a single GPU compared to the corresponding CPU implementation ● If multiple GPUs are available, near-linear speedup over a single GPU can be expected (6x - 7x with 8 GPUs) ● Experiments with a single GeForce GTX 1050 Ti Max-Q GPU:
  • 29. Contributions ● Gathered several third-party datasets ● Solved the linguistic style mismatch using fine-tuning ideas ● Analyzed multiple evaluation metrics ○ Spearman rank correlation, NDCG ● Enriched the Gold Standard dataset (GS1) by submitting our model’s predictions to human annotators DualTransformer Query model Response model Query Response Similarity evaluator (non-Euclidean) Similarity score
  • 30. Results ● Our final models largely outperform baseline approaches ● All metrics are to be intended as higher is better
  • 31. Deployment and open source release ● JRC will start using our retrieval engine now ○ Proud to say that we exceeded their expectations for this project ● Our software will be released soon on GitHub under the GNU GPL-3.0 license ○ JRC’s GitHub repository: github.com/ec-jrc
  • 32. Francesco Cariaggi (@FCariaggi) Cristiano De Nobili, PhD (@denocris) Sébastien Bratières (@Seb_Bratieres) Thank you for your attention. Leveraging NLP to achieve environmental sustainability