This document summarizes the use of scikit-learn for text mining at Jurismarchés. It describes CityZenMap, which uses NLP and ML to automatically detect and categorize territory development projects from public databases. It also provides an example of using text categorization to classify sentences in public procurement announcements as legal/administrative or project information. Finally, it discusses using conditional random fields for relation extraction to link buildings and areas mentioned in texts describing French city projects.
3. Context
3
Jurismarchés is an SME founded in 2005 based in Nantes, France.
- 14 employees, 1/2 IT team, 1/3 market experts and analysts.
Jurismarchés has two missions:
(1) helping the development of companies in the area of public
procurement by focusing on the provision of economic environment
knowledge and business
(2) democratizing the access to information on markets,
A database of more than 7,000,000 documents about markets. About 52,000
documents a year require manual content curation by a human expert, and
among them 13,000 a year lead to inherent difficulties..
4. CityZenMap:
4
A map that
allows us to
visualize and
follow territory
planning (new
buildings and
constructions) in
france.
CityZenMap E-citizenship
CityZenMap is an
E-citizenship
application that
reinforce the
relationship
between the
citizens and the
elected
representative.
autonomously detects territory
Development projects, from
public databases (Call for
tenders, commercial
authorization), that contains
announcements of several
areas
NLP
ML
5. CityZenMap:
5
The map
- Classifying the announces.
- Categorizing, the projects
by type.
- Transform and summarize
the title of the projects.
- Geo locate the projects.
To retrieve the relevant projects the app analyses the public procurement
announces. The process of finding and publishing the projects on the map is
done following three steps:
6. CityZenMap:
6
Some text mining concepts
The Goal is to build from libeled data, a function (estimator) F : X → Y
Where :
X is the observation features.
Y is the predicted values
Supervised learning:
What features for Text documents??
We need to extract the features
One of the easiest feature extraction method Bag of words :
[[0 1 0 1 0]
[1 0 1 0 1]
['are', 'pydata', 'rocking', 'rocks', 'you']
7. CityZenMap:
7
Some text mining concepts
Stem is the form to which affixes can be attached (root of a word)
A stemming algorithm reduces the words "fishing", "fished", and "fisher" to
the root word, "fish"
The stemming:
An example of stemming using Sklearn & NLTK:
[[0 1 1 0]
[1 0 1 1]]
['are', 'pydata', 'rock', 'you']
11. CityZenMap:
11
The results
We have 2200 projects (observations) labeled by experts, 12% positive
and 88% negative (stems based RMF).
precision recall f1-score support
Non pertinent 0.941 0.969 0.955 88 %
Pertinent 0.676 0.511 0.582 12 %
avg / total 0.911 0.918 0.913 100%
The precision of positive class is due to:
- lack of example
- some errors on the experts labeling
The stems based classifier is more efficient with long project title:
- Restauration de l'Ozon : Maitrise d'oeuvre pour l'aménagement morphologique et
écologique des tronçons prioritaires Etude hydraulique et proposition de scénarios de
protection des zones.
Non stems :
- Construction de la Cité scolaire de Luzech
12. Text categorization example
12
Context
public
procurement
announces
contain legal
and
administrative
information
(noise)
Juridico-
admin The utility
- Improve the
results of
research engine
- A useful tool for
the experts at
Jurismarchés
detect in the announces if
a sentence is a legal and
administrative content, or a
passage that gives
information about project.
Binary
classifcation
14. Text categorization example
14
The results
We have 1M sentences (observations), 20% for test and 80% for
training.
precision recall f1-score support
non 0.99 0.99 0.99 29659
oui 1.00 1.00 1.00 153906
avg / total 1.00 1.00 1.00 183565
20% are negative and 80% are
positive
In our case reducing the false positive
is the most important so,we ignore
positive observation with a classification
probability smaller than a threshold
15. Relation Extraction:
15
introduction
Context
- The modeling of
French cities, from
the textual
description of
territory
development
projects
How ?
- Detect the area of
each building from
the text.
- In other word link
each building with
its respective area.
Relationextraction
Its aims is to detect
(establish)
relationship
between named
entity.
17. Relation Extraction:
17
CRFs (theory)
A CRF model consists of:
- F = <f1,...,fk>, a vector of Features function
- λ = <λ1,...,λk>, a vector of weights
Let X = <x1,...,xt>, be an observed sentence
Let Y = <y1,...,yt>, be the labels
Conditional distribution:
Normalization:
19. Relation Extraction:
19
Conditional Random Fields
(Methodology)
method
- Our approach is to
look at the relation
extraction, as a
sequence tagging
problem
The labels used for the relationship extraction
are:
- B-S: beginning of the relation (building
or area).
- E-S: ending of the relation (building or
area).
- I-S: tokens between a building and an
area (continuity)
- 0: of fields tokens
20. Relation Extraction:
20
Conditional Random Fields
(Methodology)
method
- Our approach is to
look at the relation
extraction, as a
sequence tagging
problem
Example1:
salle d'une surface de 200_m² ==>
- B-S: salle
- I-S: d'une surface de
- E-S: 200_m²
Example2:
Construction de 2 logements, vrd et
jardins , la surface du projet est de
(1000m²) => '0'
21. Relation Extraction:
21
Conditional Random Fields
(features)
For each token, this list of features is constructed:
•
word.lower: the word in lowercase
•
word.isupper: true if the word is in uppercase
•
word.istitle: true if the first letter is in uppercase
•
word.isdigit: true if the word is a digit
•
postag: the part of speech
•
postag[:2]: the second part of the pos
•
cpt: the type (surface, building or other)
•
lemma: the lemma of the word
•
context: for each token we add the features of two
previous and next tokens
23. Relation Extraction:
23
Conditional Random Fields
(learning)
We used Gridsearch
In order to choose
the best parameters
an the most efcient
learning algorithm.
To generate learning sample we used
regular expressions (PLCs) to
automatically tag sentences that are
corrected by experts ,
Example:
(?P<phrase>((?:[^{]|^)%infra[^%]*%)s*([^%]w*s*){0,13}
((|comprendw*|evaluew*|de|d'w+|representw*|
environw*|s*)s*([^%]w*s*){0,3}(%surf[^%]*%))
The best estimators was trained by
averaged perceptron algorithms with
180 epoch
24. Relation Extraction:
24
Conditional Random Fields
(results)
We have 1100 sentences (observations), 20% for
test and 80% for training.
The results for the best model are in the table below:
precision recall f1-score support
0 0.968 0.958 0.963 4489
E-S 0.921 0.874 0.897 214
I-S 0.873 0.920 0.896 1271
B-S 0.911 0.864 0.887 214
avg/total 0.945 0.944 0.944 6188
25. Relation Extraction:
25
Conditional Random Fields
(results)
Knowing that the training set contains only labels with a
single link (area-building) even if there are several, the
model managed to extract several in the same sentence
example:
Results:
['logements', '687_m²'],
['10_m²', locaux_poubelles']
—une zone " logements " composée de logements sous forme de maisons en bande de 687_m²
utiles ( + 10_m² de locaux_poubelles ) .
B-S : E-S : I-S :