Scikit-learn for text mining at Jurismarchés

Scikit-learn for text mining at Jurismarchés.
Oussama Ahmia
Jusimarchés – IRISA

Outlines:
2
1. Introduction.
2. CityZenMap.
3. Text categorization example.
4. Relation extraction.

Context
3
Jurismarchés is an SME founded in 2005 based in Nantes, France.
- 14 employees, 1/2 IT team, 1/3 market experts and analysts.
Jurismarchés has two missions:
(1) helping the development of companies in the area of public
procurement by focusing on the provision of economic environment
knowledge and business
(2) democratizing the access to information on markets,
A database of more than 7,000,000 documents about markets. About 52,000
documents a year require manual content curation by a human expert, and
among them 13,000 a year lead to inherent difficulties..

CityZenMap:
4
A map that
allows us to
visualize and
follow territory
planning (new
buildings and
constructions) in
france.
CityZenMap E-citizenship
CityZenMap is an
E-citizenship
application that
reinforce the
relationship
between the
citizens and the
elected
representative.
autonomously detects territory
Development projects, from
public databases (Call for
tenders, commercial
authorization), that contains
announcements of several
areas
NLP
ML

CityZenMap:
5
The map
- Classifying the announces.
- Categorizing, the projects
by type.
- Transform and summarize
the title of the projects.
- Geo locate the projects.
To retrieve the relevant projects the app analyses the public procurement
announces. The process of finding and publishing the projects on the map is
done following three steps:

CityZenMap:
6
Some text mining concepts
The Goal is to build from libeled data, a function (estimator) F : X → Y
Where :
X is the observation features.
Y is the predicted values
Supervised learning:
What features for Text documents??
We need to extract the features
One of the easiest feature extraction method Bag of words :
[[0 1 0 1 0]
[1 0 1 0 1]
['are', 'pydata', 'rocking', 'rocks', 'you']

CityZenMap:
7
Some text mining concepts
Stem is the form to which affixes can be attached (root of a word)
A stemming algorithm reduces the words "fishing", "fished", and "fisher" to
the root word, "fish"
The stemming:
An example of stemming using Sklearn & NLTK:
[[0 1 1 0]
[1 0 1 1]]
['are', 'pydata', 'rock', 'you']

CityZenMap:
8
Random forest
Randomization
• Bootstrap samples
• Random selection variables

CityZenMap:
11
The results
We have 2200 projects (observations) labeled by experts, 12% positive
and 88% negative (stems based RMF).
precision recall f1-score support
Non pertinent 0.941 0.969 0.955 88 %
Pertinent 0.676 0.511 0.582 12 %
avg / total 0.911 0.918 0.913 100%
The precision of positive class is due to:
- lack of example
- some errors on the experts labeling
The stems based classifier is more efficient with long project title:
- Restauration de l'Ozon : Maitrise d'oeuvre pour l'aménagement morphologique et
écologique des tronçons prioritaires Etude hydraulique et proposition de scénarios de
protection des zones.
Non stems :
- Construction de la Cité scolaire de Luzech

Text categorization example
12
Context
public
procurement
announces
contain legal
and
administrative
information
(noise)
Juridico-
admin The utility
- Improve the
results of
research engine
- A useful tool for
the experts at
Jurismarchés
detect in the announces if
a sentence is a legal and
administrative content, or a
passage that gives
information about project.
Binary
classifcation

CityZenMap:
13
Scikit-learn pipelines
Using Pipeline to
implement a work
flow
Saving the pipeline
loading the pipeline

Text categorization example
14
The results
We have 1M sentences (observations), 20% for test and 80% for
training.
non 0.99 0.99 0.99 29659
oui 1.00 1.00 1.00 153906
avg / total 1.00 1.00 1.00 183565
20% are negative and 80% are
positive
In our case reducing the false positive
is the most important so,we ignore
positive observation with a classification
probability smaller than a threshold

Relation Extraction:
15
introduction
Context
- The modeling of
French cities, from
the textual
description of
territory
development
projects
How ?
- Detect the area of
each building from
the text.
- In other word link
each building with
its respective area.
Relationextraction
Its aims is to detect
(establish)
relationship
between named
entity.

16
Conditional Random Fields
Looks likeHMM, smellslike linearregression

17
CRFs (theory)
A CRF model consists of:
- F = <f1,...,fk>, a vector of Features function
- λ = <λ1,...,λk>, a vector of weights
Let X = <x1,...,xt>, be an observed sentence
Let Y = <y1,...,yt>, be the labels
Conditional distribution:
Normalization:

18
CRFs (example)
CRF example
Feature functions

19
(Methodology)
method
- Our approach is to
look at the relation
extraction, as a
sequence tagging
problem
The labels used for the relationship extraction
are:
- B-S: beginning of the relation (building
or area).
- E-S: ending of the relation (building or
area).
- I-S: tokens between a building and an
area (continuity)
- 0: of fields tokens

20
(Methodology)
method
- Our approach is to
look at the relation
extraction, as a
sequence tagging
problem
Example1:
salle d'une surface de 200_m² ==>
- B-S: salle
- I-S: d'une surface de
- E-S: 200_m²
Example2:
Construction de 2 logements, vrd et
jardins , la surface du projet est de
(1000m²) => '0'

21
(features)
For each token, this list of features is constructed:
•
word.lower: the word in lowercase
•
word.isupper: true if the word is in uppercase
•
word.istitle: true if the first letter is in uppercase
•
word.isdigit: true if the word is a digit
•
postag: the part of speech
•
postag[:2]: the second part of the pos
•
cpt: the type (surface, building or other)
•
lemma: the lemma of the word
•
context: for each token we add the features of two
previous and next tokens

22
(features)
{'+1:cpt': '',
'+1:lemma': 'utile',
'+1:postag': 'ADJ',
'+1:postag[:2]': '0',
'+1:word.istitle()': False,
'+1:word.isupper()': False,
'+1:word.lower()': 'utiles',
'+2:cpt': '',
'+2:lemma': '(',
'+2:postag': 'PUN',
'+2:postag[:2]': '0',
'+2:word.istitle()': False,
'+2:word.isupper()': False,
'+2:word.lower()': '(',
'-1:cpt': '',
'-1:lemma': 'de',
'-1:postag': 'PRP',
'-1:postag[:2]': '0',
'-1:word.istitle()': False,
'-1:word.isupper()': False,
'-1:word.lower()': 'de',
'-2:cpt': '',
'-2:postag': 'NOM',
'-2:postag[:2]': '0',
'-2:word.istitle()': False,
'-2:word.isupper()': False,
'-2:word.lower()': 'bande',
'-2lemma': 'bande',
'bias': 1.0,
'cpt': 'surface',
'lemma': '@CARD',
'postag': 'NUM',
'postag[:2]': '0',
'type': ['utiles'],
'word.isdigit()': False,
'word.istitle()': False,
'word.isupper()': False,
'word.lower()': '687_m²'}
— une zone " logements " composée de logements sous forme de maisons en bande de 687_m²687_m² utiles ( + 10_m² de
locaux_poubelles poubelles) .
Example:

23
(learning)
We used Gridsearch
In order to choose
the best parameters
an the most efcient
learning algorithm.
To generate learning sample we used
regular expressions (PLCs) to
automatically tag sentences that are
corrected by experts ,
Example:
(?P<phrase>((?:[^{]|^)%infra[^%]*%)s*([^%]w*s*){0,13}
((|comprendw*|evaluew*|de|d'w+|representw*|
environw*|s*)s*([^%]w*s*){0,3}(%surf[^%]*%))
The best estimators was trained by
averaged perceptron algorithms with
180 epoch

24
(results)
We have 1100 sentences (observations), 20% for
test and 80% for training.
The results for the best model are in the table below:
0 0.968 0.958 0.963 4489
E-S 0.921 0.874 0.897 214
I-S 0.873 0.920 0.896 1271
B-S 0.911 0.864 0.887 214
avg/total 0.945 0.944 0.944 6188

25
(results)
Knowing that the training set contains only labels with a
single link (area-building) even if there are several, the
model managed to extract several in the same sentence
example:
Results:
['logements', '687_m²'],
['10_m²', locaux_poubelles']
—une zone " logements " composée de logements sous forme de maisons en bande de 687_m²
utiles ( + 10_m² de locaux_poubelles ) .
B-S : E-S : I-S :

Scikit-learn for text mining at Jurismarchés

Scikit-learn for text mining at Jurismarchés

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Scikit-learn for text mining at Jurismarchés

Similar to Scikit-learn for text mining at Jurismarchés (20)

Recently uploaded

Recently uploaded (20)

Scikit-learn for text mining at Jurismarchés