2. GRDF: Bringing natural gas everyday
to 11 million people
● Engie: international gas & electricity company
based in France, serving 70 countries with
60% renewable energy
● GRDF: subsidiary dedicated to the
distribution of natural gas with 200’000 km of
pipelines (largest in Europe!)
● Historical partnership with Dataiku since the
creation of GRDF Data Lab in 2014
○ Our very first energy client!
○ 24 business-driven projects
○ 10+ Data Scientists & 20+ business users
3. How to better manage an
infrastructure network inventory?
● Inventory management: need to regularly update the
maintenance database to reflect truth
➡ Stakes of industrial security and financial
compliance
● Budget of 1M€ per day dedicated to maintenance
● Let’s go merging (data sources)!
GMAO + QE = RIO2
● Challenge: no direct way to join the maintenance
and client databases
➡ 460’000 addresses needing manual inspection to
check for missing equipment
Maintenance
database
Client
database
Inventory
database
4. Data Science to the rescue!
● Cost of one field visit: 25 €
● Optimizing the manual process to inspect the
460’000 addresses
1. Office verification to check if it exists in
the maintenance database
2. If not, field visit (outsourced)
3. Update of the database
● Optimizing Step 1 allows to save office worker
time, reduce the number of field visits, and
avoid the creation of duplicates
YOU ONLY VISIT ONCE
Fuzzy matching to facilitate manual inspection
OR DON’T VISIT AT ALL
5. Fuzzy matching from inventory to
maintenance database
Fun stuff: making tables out of unstructured data ; SQL parallelization over ~100m rows ;
Custom fuzzy matching script to avoid doing 100m x 100m computations
Step 1 - Address matching
Inventory
Database
(MongoDB)
Maintenance
Database
(SAP ⇾ Exadata)
Data preparation Modeling
6. Distance metric on the spelling of
the “Street” field
08/11/2017
● Use of classic text mining techniques to
define a distance metric on text fields
● Chosen metric: the normalized
Levenshtein distance
● Computation for all addresses of the
two databases in a given zipcode
● Chosen threshold at 0.8 to flag potential
duplicates
Custom rules for using numbers in a
given street
● Refinement over the previously
detected duplicates
● Comparison over the street numbers
of the potential duplicates
○ Identical number: top priority
○ Same block (number +/- 10)
○ Missing data
0 1
Methodology for address matching
7. ● Initial definition by Vladimir Levenshtein in 1966
● Principle: counting the number of elementary
operations (deletion, insertion, substitution) to
get from one word to the other
● Implementation in Python using the open source
library fuzzywuzzy optimized with C
08/11/2017
kittens ⇢ kitten (deletion)
kitten ⇢ sitten (substitution)
sitten ⇢ sittin (substitution)
sittin ⇢ sitting (insertion)
A distance metric over text?
R.I.P. Prof.
Levenshtein...
8. Fuzzy matching from inventory to
maintenance database
Fun stuff: creative ways to compute metrics on industrial structure based on raw data ; finding
the right distance metric for mixed-type data
Step 2 - Industrial structure matching
Inventory
Database
Maintenance
Database
Data preparation Modeling
Duplicate
Database
9. 08/11/2017
● 6 quantitative metrics: Number of
Conduites Individuelles, Conduites
Montantes, Conduites de Coursive,
Branchements Particuliers, Nourrices, Tiges
Cuisine
● 26 qualitative metrics: Building type,
Basement existence, CI-CM accessibility,
Pressure level, PBDI/DPBE existence , CM
material, CM diameter, CI-CM Tap existence
/ type / location / brand, Box existence /
type, Deposit plate existence / type,
Regulator existence / number / pressure /
type / brand, RDBP existence DDMP
existence / type / brand / year, CM number,
Lot number
What defines the industrial structure?
10. ● Challenge: how to define a normalized
distance metric on mixed-type data with
missing values
➡ Weighted Gower distance
08/11/2017
Metric RIO2 GMAO
CM number 2 3
CI number 5 5
Nourrice number 1 n/a
BP number 0 4
Regulator type Brand 3 Brand 3
Box location OUT OF LIST BURIED
CM material Material 1 Material 3
Tap brand Brand 1 Brand 1
CM number n/a n/a
…
?
Defining a distance in non-Euclidean space
11. 08/11/2017
● Definition of two distinct distance metrics
based on specific data:
○ Global variables
○ Regulator-related variables
● Manual adjustments to the weighting scheme:
○ Global metric: 50% of weights on
regulator-related variables
○ Variable-specific weighting to penalize
columns with high rate of missing values
● Custom filters based on the metrics values
○ Identical Lot / CM number (sure duplicate)
○ Global metric ≥ 0.8
○ Regulator metric ≥ 0.8
Business-tuning the metric
Distance = 1 - Proximity
12. How I stopped
worrying and learned
to love messy data
• Stop the blame game!
• No free lunch in databases
• Data quality is the common ground
between business and science