SlideShare a Scribd company logo
1 of 26
Download to read offline
Scikit-learn for text mining at Jurismarchés.
Oussama Ahmia
Jusimarchés – IRISA
Outlines:
2
1. Introduction.
2. CityZenMap.
3. Text categorization example.
4. Relation extraction.
Context
3
Jurismarchés is an SME founded in 2005 based in Nantes, France.
- 14 employees, 1/2 IT team, 1/3 market experts and analysts.
Jurismarchés has two missions:
(1) helping the development of companies in the area of public
procurement by focusing on the provision of economic environment
knowledge and business
(2) democratizing the access to information on markets,
A database of more than 7,000,000 documents about markets. About 52,000
documents a year require manual content curation by a human expert, and
among them 13,000 a year lead to inherent difficulties..
CityZenMap:
4
A map that
allows us to
visualize and
follow territory
planning (new
buildings and
constructions) in
france.
CityZenMap E-citizenship
CityZenMap is an
E-citizenship
application that
reinforce the
relationship
between the
citizens and the
elected
representative.
autonomously detects territory
Development projects, from
public databases (Call for
tenders, commercial
authorization), that contains
announcements of several
areas
NLP
ML
CityZenMap:
5
The map
- Classifying the announces.
- Categorizing, the projects
by type.
- Transform and summarize
the title of the projects.
- Geo locate the projects.
To retrieve the relevant projects the app analyses the public procurement
announces. The process of finding and publishing the projects on the map is
done following three steps:
CityZenMap:
6
Some text mining concepts
The Goal is to build from libeled data, a function (estimator) F : X → Y
Where :
X is the observation features.
Y is the predicted values
Supervised learning:
What features for Text documents??
We need to extract the features
One of the easiest feature extraction method Bag of words :
[[0 1 0 1 0]
[1 0 1 0 1]
['are', 'pydata', 'rocking', 'rocks', 'you']
CityZenMap:
7
Some text mining concepts
Stem is the form to which affixes can be attached (root of a word)
A stemming algorithm reduces the words "fishing", "fished", and "fisher" to
the root word, "fish"
The stemming:
An example of stemming using Sklearn & NLTK:
[[0 1 1 0]
[1 0 1 1]]
['are', 'pydata', 'rock', 'you']
CityZenMap:
8
Random forest
Randomization
• Bootstrap samples
• Random selection variables
CityZenMap:
9
The workflow
CityZenMap:
10
The workflow
CityZenMap:
11
The results
We have 2200 projects (observations) labeled by experts, 12% positive
and 88% negative (stems based RMF).
precision recall f1-score support
Non pertinent 0.941 0.969 0.955 88 %
Pertinent 0.676 0.511 0.582 12 %
avg / total 0.911 0.918 0.913 100%
The precision of positive class is due to:
- lack of example
- some errors on the experts labeling
The stems based classifier is more efficient with long project title:
- Restauration de l'Ozon : Maitrise d'oeuvre pour l'aménagement morphologique et
écologique des tronçons prioritaires Etude hydraulique et proposition de scénarios de
protection des zones.
Non stems :
- Construction de la Cité scolaire de Luzech
Text categorization example
12
Context
public
procurement
announces
contain legal
and
administrative
information
(noise)
Juridico-
admin The utility
- Improve the
results of
research engine
- A useful tool for
the experts at
Jurismarchés
detect in the announces if
a sentence is a legal and
administrative content, or a
passage that gives
information about project.
Binary
classifcation
CityZenMap:
13
Scikit-learn pipelines
Using Pipeline to
implement a work
flow
Saving the pipeline
loading the pipeline
Text categorization example
14
The results
We have 1M sentences (observations), 20% for test and 80% for
training.
precision recall f1-score support
non 0.99 0.99 0.99 29659
oui 1.00 1.00 1.00 153906
avg / total 1.00 1.00 1.00 183565
20% are negative and 80% are
positive
In our case reducing the false positive
is the most important so,we ignore
positive observation with a classification
probability smaller than a threshold
Relation Extraction:
15
introduction
Context
- The modeling of
French cities, from
the textual
description of
territory
development
projects
How ?
- Detect the area of
each building from
the text.
- In other word link
each building with
its respective area.
Relationextraction
Its aims is to detect
(establish)
relationship
between named
entity.
Relation Extraction:
16
Conditional Random Fields
Looks likeHMM, smellslike linearregression
Relation Extraction:
17
CRFs (theory)
A CRF model consists of:
- F = <f1,...,fk>, a vector of Features function
- λ = <λ1,...,λk>, a vector of weights
Let X = <x1,...,xt>, be an observed sentence
Let Y = <y1,...,yt>, be the labels
Conditional distribution:
Normalization:
Relation Extraction:
18
CRFs (example)
CRF example
Feature functions
Relation Extraction:
19
Conditional Random Fields
(Methodology)
method
- Our approach is to
look at the relation
extraction, as a
sequence tagging
problem
The labels used for the relationship extraction
are:
- B-S: beginning of the relation (building
or area).
- E-S: ending of the relation (building or
area).
- I-S: tokens between a building and an
area (continuity)
- 0: of fields tokens
Relation Extraction:
20
Conditional Random Fields
(Methodology)
method
- Our approach is to
look at the relation
extraction, as a
sequence tagging
problem
Example1:
salle d'une surface de 200_m² ==>
- B-S: salle
- I-S: d'une surface de
- E-S: 200_m²
Example2:
Construction de 2 logements, vrd et
jardins , la surface du projet est de
(1000m²) => '0'
Relation Extraction:
21
Conditional Random Fields
(features)
For each token, this list of features is constructed:
•
word.lower: the word in lowercase
•
word.isupper: true if the word is in uppercase
•
word.istitle: true if the first letter is in uppercase
•
word.isdigit: true if the word is a digit
•
postag: the part of speech
•
postag[:2]: the second part of the pos
•
cpt: the type (surface, building or other)
•
lemma: the lemma of the word
•
context: for each token we add the features of two
previous and next tokens
Relation Extraction:
22
Conditional Random Fields
(features)
{'+1:cpt': '',
'+1:lemma': 'utile',
'+1:postag': 'ADJ',
'+1:postag[:2]': '0',
'+1:word.istitle()': False,
'+1:word.isupper()': False,
'+1:word.lower()': 'utiles',
'+2:cpt': '',
'+2:lemma': '(',
'+2:postag': 'PUN',
'+2:postag[:2]': '0',
'+2:word.istitle()': False,
'+2:word.isupper()': False,
'+2:word.lower()': '(',
'-1:cpt': '',
'-1:lemma': 'de',
'-1:postag': 'PRP',
'-1:postag[:2]': '0',
'-1:word.istitle()': False,
'-1:word.isupper()': False,
'-1:word.lower()': 'de',
'-2:cpt': '',
'-2:postag': 'NOM',
'-2:postag[:2]': '0',
'-2:word.istitle()': False,
'-2:word.isupper()': False,
'-2:word.lower()': 'bande',
'-2lemma': 'bande',
'bias': 1.0,
'cpt': 'surface',
'lemma': '@CARD',
'postag': 'NUM',
'postag[:2]': '0',
'type': ['utiles'],
'word.isdigit()': False,
'word.istitle()': False,
'word.isupper()': False,
'word.lower()': '687_m²'}
— une zone " logements " composée de logements sous forme de maisons en bande de 687_m²687_m² utiles ( + 10_m² de
locaux_poubelles poubelles) .
Example:
Relation Extraction:
23
Conditional Random Fields
(learning)
We used Gridsearch
In order to choose
the best parameters
an the most efcient
learning algorithm.
To generate learning sample we used
regular expressions (PLCs) to
automatically tag sentences that are
corrected by experts ,
Example:
(?P<phrase>((?:[^{]|^)%infra[^%]*%)s*([^%]w*s*){0,13}
((|comprendw*|evaluew*|de|d'w+|representw*|
environw*|s*)s*([^%]w*s*){0,3}(%surf[^%]*%))
The best estimators was trained by
averaged perceptron algorithms with
180 epoch
Relation Extraction:
24
Conditional Random Fields
(results)
We have 1100 sentences (observations), 20% for
test and 80% for training.
The results for the best model are in the table below:
precision recall f1-score support
0 0.968 0.958 0.963 4489
E-S 0.921 0.874 0.897 214
I-S 0.873 0.920 0.896 1271
B-S 0.911 0.864 0.887 214
avg/total 0.945 0.944 0.944 6188
Relation Extraction:
25
Conditional Random Fields
(results)
Knowing that the training set contains only labels with a
single link (area-building) even if there are several, the
model managed to extract several in the same sentence
example:
Results:
['logements', '687_m²'],
['10_m²', locaux_poubelles']
—une zone " logements " composée de logements sous forme de maisons en bande de 687_m²
utiles ( + 10_m² de locaux_poubelles ) .
B-S : E-S : I-S :
Scikit-learn for text mining at Jurismarchés

More Related Content

Viewers also liked

Automatic Machine Learning
Automatic Machine LearningAutomatic Machine Learning
Automatic Machine LearningPyDataParis
 
Prespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPrespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPyDataParis
 
Wendelin : From Stock Movements to Pivot Tables Inside Jupyter
Wendelin : From Stock Movements to Pivot Tables Inside JupyterWendelin : From Stock Movements to Pivot Tables Inside Jupyter
Wendelin : From Stock Movements to Pivot Tables Inside JupyterPyDataParis
 
Lightning large scale machine learning in python
Lightning  large scale machine learning in pythonLightning  large scale machine learning in python
Lightning large scale machine learning in pythonPyDataParis
 
Collect pydata from your processes
Collect pydata from your processesCollect pydata from your processes
Collect pydata from your processesPyDataParis
 
Extracting and analyzing online confessions
Extracting and analyzing online confessionsExtracting and analyzing online confessions
Extracting and analyzing online confessionsPyDataParis
 
Python to report in one command
Python to report in one commandPython to report in one command
Python to report in one commandPyDataParis
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentationPyDataParis
 
Eamanitech Pvt Ltd - One stop solution for IT Services in India
Eamanitech Pvt Ltd - One stop solution for IT Services in IndiaEamanitech Pvt Ltd - One stop solution for IT Services in India
Eamanitech Pvt Ltd - One stop solution for IT Services in IndiaEamanitech Pvt Ltd
 
Unit 15 - LO1 Poster
Unit 15 - LO1 PosterUnit 15 - LO1 Poster
Unit 15 - LO1 PosterTom Hibbert
 
Statistical Entity Linking
Statistical Entity LinkingStatistical Entity Linking
Statistical Entity LinkingPyDataParis
 
Witness Statement
Witness StatementWitness Statement
Witness StatementTom Hibbert
 

Viewers also liked (18)

Automatic Machine Learning
Automatic Machine LearningAutomatic Machine Learning
Automatic Machine Learning
 
Prespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandasPrespective analytics with DOcplex and pandas
Prespective analytics with DOcplex and pandas
 
UNIT 35 - LO2
UNIT 35 - LO2UNIT 35 - LO2
UNIT 35 - LO2
 
Wendelin : From Stock Movements to Pivot Tables Inside Jupyter
Wendelin : From Stock Movements to Pivot Tables Inside JupyterWendelin : From Stock Movements to Pivot Tables Inside Jupyter
Wendelin : From Stock Movements to Pivot Tables Inside Jupyter
 
Lightning large scale machine learning in python
Lightning  large scale machine learning in pythonLightning  large scale machine learning in python
Lightning large scale machine learning in python
 
LO3
LO3LO3
LO3
 
Collect pydata from your processes
Collect pydata from your processesCollect pydata from your processes
Collect pydata from your processes
 
Extracting and analyzing online confessions
Extracting and analyzing online confessionsExtracting and analyzing online confessions
Extracting and analyzing online confessions
 
Python to report in one command
Python to report in one commandPython to report in one command
Python to report in one command
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
 
Lo1
Lo1Lo1
Lo1
 
Eamanitech Pvt Ltd - One stop solution for IT Services in India
Eamanitech Pvt Ltd - One stop solution for IT Services in IndiaEamanitech Pvt Ltd - One stop solution for IT Services in India
Eamanitech Pvt Ltd - One stop solution for IT Services in India
 
Unit 15 - LO1 Poster
Unit 15 - LO1 PosterUnit 15 - LO1 Poster
Unit 15 - LO1 Poster
 
Statistical Entity Linking
Statistical Entity LinkingStatistical Entity Linking
Statistical Entity Linking
 
Witness Statement
Witness StatementWitness Statement
Witness Statement
 
Proposal
ProposalProposal
Proposal
 
Lo3
Lo3Lo3
Lo3
 
Pitch LO4
Pitch LO4Pitch LO4
Pitch LO4
 

Similar to Scikit-learn for text mining at Jurismarchés

Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
 
Towards a language server protocol infrastructure for graphical modeling
Towards a language server protocol infrastructure for graphical modelingTowards a language server protocol infrastructure for graphical modeling
Towards a language server protocol infrastructure for graphical modelingRoberto Rodriguez-Echeverria
 
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...Flink Forward
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fpAlexander Granin
 
Virtual enterprise synthesys
 Virtual enterprise synthesys Virtual enterprise synthesys
Virtual enterprise synthesysVictor Romanov
 
MIPS_Programming.pdf
MIPS_Programming.pdfMIPS_Programming.pdf
MIPS_Programming.pdfXxUnnathxX
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
 
Rcpp: Seemless R and C++
Rcpp: Seemless R and C++Rcpp: Seemless R and C++
Rcpp: Seemless R and C++Romain Francois
 
Vectorization vs Compilation
Vectorization vs CompilationVectorization vs Compilation
Vectorization vs CompilationAlex Averbuch
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On RandomnessRanel Padon
 
B.sc CSIT 2nd semester C++ Unit2
B.sc CSIT  2nd semester C++ Unit2B.sc CSIT  2nd semester C++ Unit2
B.sc CSIT 2nd semester C++ Unit2Tekendra Nath Yogi
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine LearningLuciano Resende
 
Rcpp: Seemless R and C++
Rcpp: Seemless R and C++Rcpp: Seemless R and C++
Rcpp: Seemless R and C++Romain Francois
 
Project 2: Baseband Data Communication
Project 2: Baseband Data CommunicationProject 2: Baseband Data Communication
Project 2: Baseband Data CommunicationDanish Bangash
 
Aggregation Functions in OCL
Aggregation Functions in OCL Aggregation Functions in OCL
Aggregation Functions in OCL Jordi Cabot
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKJeremy Chen
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not EnoughLukas Renggli
 

Similar to Scikit-learn for text mining at Jurismarchés (20)

Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 
Towards a language server protocol infrastructure for graphical modeling
Towards a language server protocol infrastructure for graphical modelingTowards a language server protocol infrastructure for graphical modeling
Towards a language server protocol infrastructure for graphical modeling
 
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
Flink Forward SF 2017: Erik de Nooij - StreamING models, how ING adds models ...
 
PECCS 2014
PECCS 2014PECCS 2014
PECCS 2014
 
Hierarchical free monads and software design in fp
Hierarchical free monads and software design in fpHierarchical free monads and software design in fp
Hierarchical free monads and software design in fp
 
Virtual enterprise synthesys
 Virtual enterprise synthesys Virtual enterprise synthesys
Virtual enterprise synthesys
 
F sharp - an overview
F sharp - an overviewF sharp - an overview
F sharp - an overview
 
MIPS_Programming.pdf
MIPS_Programming.pdfMIPS_Programming.pdf
MIPS_Programming.pdf
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
Rcpp: Seemless R and C++
Rcpp: Seemless R and C++Rcpp: Seemless R and C++
Rcpp: Seemless R and C++
 
Vectorization vs Compilation
Vectorization vs CompilationVectorization vs Compilation
Vectorization vs Compilation
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On Randomness
 
B.sc CSIT 2nd semester C++ Unit2
B.sc CSIT  2nd semester C++ Unit2B.sc CSIT  2nd semester C++ Unit2
B.sc CSIT 2nd semester C++ Unit2
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
 
Rcpp: Seemless R and C++
Rcpp: Seemless R and C++Rcpp: Seemless R and C++
Rcpp: Seemless R and C++
 
Project 2: Baseband Data Communication
Project 2: Baseband Data CommunicationProject 2: Baseband Data Communication
Project 2: Baseband Data Communication
 
Aggregation Functions in OCL
Aggregation Functions in OCL Aggregation Functions in OCL
Aggregation Functions in OCL
 
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
Neo4j GraphTalk Helsinki - Next-Gerneation Telecommunication Solutions with N...
 
Applying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPKApplying Linear Optimization Using GLPK
Applying Linear Optimization Using GLPK
 
The Dynamic Language is not Enough
The Dynamic Language is not EnoughThe Dynamic Language is not Enough
The Dynamic Language is not Enough
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Scikit-learn for text mining at Jurismarchés

  • 1. Scikit-learn for text mining at Jurismarchés. Oussama Ahmia Jusimarchés – IRISA
  • 2. Outlines: 2 1. Introduction. 2. CityZenMap. 3. Text categorization example. 4. Relation extraction.
  • 3. Context 3 Jurismarchés is an SME founded in 2005 based in Nantes, France. - 14 employees, 1/2 IT team, 1/3 market experts and analysts. Jurismarchés has two missions: (1) helping the development of companies in the area of public procurement by focusing on the provision of economic environment knowledge and business (2) democratizing the access to information on markets, A database of more than 7,000,000 documents about markets. About 52,000 documents a year require manual content curation by a human expert, and among them 13,000 a year lead to inherent difficulties..
  • 4. CityZenMap: 4 A map that allows us to visualize and follow territory planning (new buildings and constructions) in france. CityZenMap E-citizenship CityZenMap is an E-citizenship application that reinforce the relationship between the citizens and the elected representative. autonomously detects territory Development projects, from public databases (Call for tenders, commercial authorization), that contains announcements of several areas NLP ML
  • 5. CityZenMap: 5 The map - Classifying the announces. - Categorizing, the projects by type. - Transform and summarize the title of the projects. - Geo locate the projects. To retrieve the relevant projects the app analyses the public procurement announces. The process of finding and publishing the projects on the map is done following three steps:
  • 6. CityZenMap: 6 Some text mining concepts The Goal is to build from libeled data, a function (estimator) F : X → Y Where : X is the observation features. Y is the predicted values Supervised learning: What features for Text documents?? We need to extract the features One of the easiest feature extraction method Bag of words : [[0 1 0 1 0] [1 0 1 0 1] ['are', 'pydata', 'rocking', 'rocks', 'you']
  • 7. CityZenMap: 7 Some text mining concepts Stem is the form to which affixes can be attached (root of a word) A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish" The stemming: An example of stemming using Sklearn & NLTK: [[0 1 1 0] [1 0 1 1]] ['are', 'pydata', 'rock', 'you']
  • 8. CityZenMap: 8 Random forest Randomization • Bootstrap samples • Random selection variables
  • 11. CityZenMap: 11 The results We have 2200 projects (observations) labeled by experts, 12% positive and 88% negative (stems based RMF). precision recall f1-score support Non pertinent 0.941 0.969 0.955 88 % Pertinent 0.676 0.511 0.582 12 % avg / total 0.911 0.918 0.913 100% The precision of positive class is due to: - lack of example - some errors on the experts labeling The stems based classifier is more efficient with long project title: - Restauration de l'Ozon : Maitrise d'oeuvre pour l'aménagement morphologique et écologique des tronçons prioritaires Etude hydraulique et proposition de scénarios de protection des zones. Non stems : - Construction de la Cité scolaire de Luzech
  • 12. Text categorization example 12 Context public procurement announces contain legal and administrative information (noise) Juridico- admin The utility - Improve the results of research engine - A useful tool for the experts at Jurismarchés detect in the announces if a sentence is a legal and administrative content, or a passage that gives information about project. Binary classifcation
  • 13. CityZenMap: 13 Scikit-learn pipelines Using Pipeline to implement a work flow Saving the pipeline loading the pipeline
  • 14. Text categorization example 14 The results We have 1M sentences (observations), 20% for test and 80% for training. precision recall f1-score support non 0.99 0.99 0.99 29659 oui 1.00 1.00 1.00 153906 avg / total 1.00 1.00 1.00 183565 20% are negative and 80% are positive In our case reducing the false positive is the most important so,we ignore positive observation with a classification probability smaller than a threshold
  • 15. Relation Extraction: 15 introduction Context - The modeling of French cities, from the textual description of territory development projects How ? - Detect the area of each building from the text. - In other word link each building with its respective area. Relationextraction Its aims is to detect (establish) relationship between named entity.
  • 16. Relation Extraction: 16 Conditional Random Fields Looks likeHMM, smellslike linearregression
  • 17. Relation Extraction: 17 CRFs (theory) A CRF model consists of: - F = <f1,...,fk>, a vector of Features function - λ = <λ1,...,λk>, a vector of weights Let X = <x1,...,xt>, be an observed sentence Let Y = <y1,...,yt>, be the labels Conditional distribution: Normalization:
  • 18. Relation Extraction: 18 CRFs (example) CRF example Feature functions
  • 19. Relation Extraction: 19 Conditional Random Fields (Methodology) method - Our approach is to look at the relation extraction, as a sequence tagging problem The labels used for the relationship extraction are: - B-S: beginning of the relation (building or area). - E-S: ending of the relation (building or area). - I-S: tokens between a building and an area (continuity) - 0: of fields tokens
  • 20. Relation Extraction: 20 Conditional Random Fields (Methodology) method - Our approach is to look at the relation extraction, as a sequence tagging problem Example1: salle d'une surface de 200_m² ==> - B-S: salle - I-S: d'une surface de - E-S: 200_m² Example2: Construction de 2 logements, vrd et jardins , la surface du projet est de (1000m²) => '0'
  • 21. Relation Extraction: 21 Conditional Random Fields (features) For each token, this list of features is constructed: • word.lower: the word in lowercase • word.isupper: true if the word is in uppercase • word.istitle: true if the first letter is in uppercase • word.isdigit: true if the word is a digit • postag: the part of speech • postag[:2]: the second part of the pos • cpt: the type (surface, building or other) • lemma: the lemma of the word • context: for each token we add the features of two previous and next tokens
  • 22. Relation Extraction: 22 Conditional Random Fields (features) {'+1:cpt': '', '+1:lemma': 'utile', '+1:postag': 'ADJ', '+1:postag[:2]': '0', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:word.lower()': 'utiles', '+2:cpt': '', '+2:lemma': '(', '+2:postag': 'PUN', '+2:postag[:2]': '0', '+2:word.istitle()': False, '+2:word.isupper()': False, '+2:word.lower()': '(', '-1:cpt': '', '-1:lemma': 'de', '-1:postag': 'PRP', '-1:postag[:2]': '0', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:word.lower()': 'de', '-2:cpt': '', '-2:postag': 'NOM', '-2:postag[:2]': '0', '-2:word.istitle()': False, '-2:word.isupper()': False, '-2:word.lower()': 'bande', '-2lemma': 'bande', 'bias': 1.0, 'cpt': 'surface', 'lemma': '@CARD', 'postag': 'NUM', 'postag[:2]': '0', 'type': ['utiles'], 'word.isdigit()': False, 'word.istitle()': False, 'word.isupper()': False, 'word.lower()': '687_m²'} — une zone " logements " composée de logements sous forme de maisons en bande de 687_m²687_m² utiles ( + 10_m² de locaux_poubelles poubelles) . Example:
  • 23. Relation Extraction: 23 Conditional Random Fields (learning) We used Gridsearch In order to choose the best parameters an the most efcient learning algorithm. To generate learning sample we used regular expressions (PLCs) to automatically tag sentences that are corrected by experts , Example: (?P<phrase>((?:[^{]|^)%infra[^%]*%)s*([^%]w*s*){0,13} ((|comprendw*|evaluew*|de|d'w+|representw*| environw*|s*)s*([^%]w*s*){0,3}(%surf[^%]*%)) The best estimators was trained by averaged perceptron algorithms with 180 epoch
  • 24. Relation Extraction: 24 Conditional Random Fields (results) We have 1100 sentences (observations), 20% for test and 80% for training. The results for the best model are in the table below: precision recall f1-score support 0 0.968 0.958 0.963 4489 E-S 0.921 0.874 0.897 214 I-S 0.873 0.920 0.896 1271 B-S 0.911 0.864 0.887 214 avg/total 0.945 0.944 0.944 6188
  • 25. Relation Extraction: 25 Conditional Random Fields (results) Knowing that the training set contains only labels with a single link (area-building) even if there are several, the model managed to extract several in the same sentence example: Results: ['logements', '687_m²'], ['10_m²', locaux_poubelles'] —une zone " logements " composée de logements sous forme de maisons en bande de 687_m² utiles ( + 10_m² de locaux_poubelles ) . B-S : E-S : I-S :