SlideShare a Scribd company logo
1 of 13
Download to read offline
©2017 Dataiku, Inc. | www.dataiku.com | contact@dataiku.com | @dataiku
How I stopped worrying and learned to love messy data
@alex_combessie
How Data Science can help energy
companies map their infrastructure
GRDF: Bringing natural gas everyday
to 11 million people
● Engie: international gas & electricity company
based in France, serving 70 countries with
60% renewable energy
● GRDF: subsidiary dedicated to the
distribution of natural gas with 200’000 km of
pipelines (largest in Europe!)
● Historical partnership with Dataiku since the
creation of GRDF Data Lab in 2014
○ Our very first energy client!
○ 24 business-driven projects
○ 10+ Data Scientists & 20+ business users
How to better manage an
infrastructure network inventory?
● Inventory management: need to regularly update the
maintenance database to reflect truth
➡ Stakes of industrial security and financial
compliance
● Budget of 1M€ per day dedicated to maintenance
● Let’s go merging (data sources)!
GMAO + QE = RIO2
● Challenge: no direct way to join the maintenance
and client databases
➡ 460’000 addresses needing manual inspection to
check for missing equipment
Maintenance
database
Client
database
Inventory
database
Data Science to the rescue!
● Cost of one field visit: 25 €
● Optimizing the manual process to inspect the
460’000 addresses
1. Office verification to check if it exists in
the maintenance database
2. If not, field visit (outsourced)
3. Update of the database
● Optimizing Step 1 allows to save office worker
time, reduce the number of field visits, and
avoid the creation of duplicates
YOU ONLY VISIT ONCE
Fuzzy matching to facilitate manual inspection
OR DON’T VISIT AT ALL
Fuzzy matching from inventory to
maintenance database
Fun stuff: making tables out of unstructured data ; SQL parallelization over ~100m rows ;
Custom fuzzy matching script to avoid doing 100m x 100m computations
Step 1 - Address matching
Inventory
Database
(MongoDB)
Maintenance
Database
(SAP ⇾ Exadata)
Data preparation Modeling
Distance metric on the spelling of
the “Street” field
08/11/2017
● Use of classic text mining techniques to
define a distance metric on text fields
● Chosen metric: the normalized
Levenshtein distance
● Computation for all addresses of the
two databases in a given zipcode
● Chosen threshold at 0.8 to flag potential
duplicates
Custom rules for using numbers in a
given street
● Refinement over the previously
detected duplicates
● Comparison over the street numbers
of the potential duplicates
○ Identical number: top priority
○ Same block (number +/- 10)
○ Missing data
0 1
Methodology for address matching
● Initial definition by Vladimir Levenshtein in 1966
● Principle: counting the number of elementary
operations (deletion, insertion, substitution) to
get from one word to the other
● Implementation in Python using the open source
library fuzzywuzzy optimized with C
08/11/2017
kittens ⇢ kitten (deletion)
kitten ⇢ sitten (substitution)
sitten ⇢ sittin (substitution)
sittin ⇢ sitting (insertion)
A distance metric over text?
R.I.P. Prof.
Levenshtein...
Fuzzy matching from inventory to
maintenance database
Fun stuff: creative ways to compute metrics on industrial structure based on raw data ; finding
the right distance metric for mixed-type data
Step 2 - Industrial structure matching
Inventory
Database
Maintenance
Database
Data preparation Modeling
Duplicate
Database
08/11/2017
● 6 quantitative metrics: Number of
Conduites Individuelles, Conduites
Montantes, Conduites de Coursive,
Branchements Particuliers, Nourrices, Tiges
Cuisine
● 26 qualitative metrics: Building type,
Basement existence, CI-CM accessibility,
Pressure level, PBDI/DPBE existence , CM
material, CM diameter, CI-CM Tap existence
/ type / location / brand, Box existence /
type, Deposit plate existence / type,
Regulator existence / number / pressure /
type / brand, RDBP existence DDMP
existence / type / brand / year, CM number,
Lot number
What defines the industrial structure?
● Challenge: how to define a normalized
distance metric on mixed-type data with
missing values
➡ Weighted Gower distance
08/11/2017
Metric RIO2 GMAO
CM number 2 3
CI number 5 5
Nourrice number 1 n/a
BP number 0 4
Regulator type Brand 3 Brand 3
Box location OUT OF LIST BURIED
CM material Material 1 Material 3
Tap brand Brand 1 Brand 1
CM number n/a n/a
…
?
Defining a distance in non-Euclidean space
08/11/2017
● Definition of two distinct distance metrics
based on specific data:
○ Global variables
○ Regulator-related variables
● Manual adjustments to the weighting scheme:
○ Global metric: 50% of weights on
regulator-related variables
○ Variable-specific weighting to penalize
columns with high rate of missing values
● Custom filters based on the metrics values
○ Identical Lot / CM number (sure duplicate)
○ Global metric ≥ 0.8
○ Regulator metric ≥ 0.8
Business-tuning the metric
Distance = 1 - Proximity
How I stopped
worrying and learned
to love messy data
• Stop the blame game!
• No free lunch in databases
• Data quality is the common ground
between business and science
Just Start. Start Now.
Fail often. Enjoy the ride.
Seth Godin

More Related Content

Similar to How Data Science can help energy companies map their infrastructure

STREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect ManufacturingSTREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect ManufacturingFulvio Bernardini
 
Innogy - data als inspiratie - jachtdag
Innogy - data als inspiratie - jachtdagInnogy - data als inspiratie - jachtdag
Innogy - data als inspiratie - jachtdagRaaf & Wolf
 
オープンハウスにおける 機械学習・データサイエンスの 取り組みについて
オープンハウスにおける機械学習・データサイエンスの取り組みについてオープンハウスにおける機械学習・データサイエンスの取り組みについて
オープンハウスにおける 機械学習・データサイエンスの 取り組みについてTeito Nakagawa
 
171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)
171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)
171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)SINTAS
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Gloria Re Calegari
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...InfluxData
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyVenkata Pingali
 
Using Graphs for Feature Engineering_ Graph Reduce-2.pdf
Using Graphs for Feature Engineering_ Graph Reduce-2.pdfUsing Graphs for Feature Engineering_ Graph Reduce-2.pdf
Using Graphs for Feature Engineering_ Graph Reduce-2.pdfWes Madrigal
 
170614 MSc Project Thales (Michel Reimert)
170614 MSc Project Thales (Michel Reimert)170614 MSc Project Thales (Michel Reimert)
170614 MSc Project Thales (Michel Reimert)SINTAS
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsIstituto nazionale di statistica
 
Actionable Carbon Tracking and Analysis with the Neo4j Graph Data Platform
Actionable Carbon Tracking and Analysis with the Neo4j Graph Data PlatformActionable Carbon Tracking and Analysis with the Neo4j Graph Data Platform
Actionable Carbon Tracking and Analysis with the Neo4j Graph Data PlatformNeo4j
 
10 Visualization of 3D data in the web environment – Open Land Use and Yield ...
10 Visualization of 3D data in the web environment – Open Land Use and Yield ...10 Visualization of 3D data in the web environment – Open Land Use and Yield ...
10 Visualization of 3D data in the web environment – Open Land Use and Yield ...plan4all
 
Duerr at a glance
Duerr at a glance Duerr at a glance
Duerr at a glance Dürr
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Building Climate Resilience: Translating Climate Data into Risk Assessments
Building Climate Resilience: Translating Climate Data into Risk Assessments Building Climate Resilience: Translating Climate Data into Risk Assessments
Building Climate Resilience: Translating Climate Data into Risk Assessments Safe Software
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...DataBench
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangDatabricks
 
BigData Technology in energy and public sector
BigData Technology in energy and public sectorBigData Technology in energy and public sector
BigData Technology in energy and public sectorKiranBhanushali6
 

Similar to How Data Science can help energy companies map their infrastructure (20)

STREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect ManufacturingSTREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect Manufacturing
 
Innogy - data als inspiratie - jachtdag
Innogy - data als inspiratie - jachtdagInnogy - data als inspiratie - jachtdag
Innogy - data als inspiratie - jachtdag
 
オープンハウスにおける 機械学習・データサイエンスの 取り組みについて
オープンハウスにおける機械学習・データサイエンスの取り組みについてオープンハウスにおける機械学習・データサイエンスの取り組みについて
オープンハウスにおける 機械学習・データサイエンスの 取り組みについて
 
171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)
171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)
171130 Dimanex MSc project (Pieter Ruijssenaars-Elise Kok)
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
 
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
Smart Urban Planning Support through Web Data Science on Open and Enterprise ...
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
 
Using Graphs for Feature Engineering_ Graph Reduce-2.pdf
Using Graphs for Feature Engineering_ Graph Reduce-2.pdfUsing Graphs for Feature Engineering_ Graph Reduce-2.pdf
Using Graphs for Feature Engineering_ Graph Reduce-2.pdf
 
170614 MSc Project Thales (Michel Reimert)
170614 MSc Project Thales (Michel Reimert)170614 MSc Project Thales (Michel Reimert)
170614 MSc Project Thales (Michel Reimert)
 
P. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European StatisticsP. Struijs, Toward the Use of Big Data for European Statistics
P. Struijs, Toward the Use of Big Data for European Statistics
 
Actionable Carbon Tracking and Analysis with the Neo4j Graph Data Platform
Actionable Carbon Tracking and Analysis with the Neo4j Graph Data PlatformActionable Carbon Tracking and Analysis with the Neo4j Graph Data Platform
Actionable Carbon Tracking and Analysis with the Neo4j Graph Data Platform
 
Life of a data engineer
Life of a data engineerLife of a data engineer
Life of a data engineer
 
10 Visualization of 3D data in the web environment – Open Land Use and Yield ...
10 Visualization of 3D data in the web environment – Open Land Use and Yield ...10 Visualization of 3D data in the web environment – Open Land Use and Yield ...
10 Visualization of 3D data in the web environment – Open Land Use and Yield ...
 
Duerr at a glance
Duerr at a glance Duerr at a glance
Duerr at a glance
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Building Climate Resilience: Translating Climate Data into Risk Assessments
Building Climate Resilience: Translating Climate Data into Risk Assessments Building Climate Resilience: Translating Climate Data into Risk Assessments
Building Climate Resilience: Translating Climate Data into Risk Assessments
 
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
Benchmarking for Big Data Applications with the DataBench Framework, Arne Ber...
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
BigData Technology in energy and public sector
BigData Technology in energy and public sectorBigData Technology in energy and public sector
BigData Technology in energy and public sector
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

How Data Science can help energy companies map their infrastructure

  • 1. ©2017 Dataiku, Inc. | www.dataiku.com | contact@dataiku.com | @dataiku How I stopped worrying and learned to love messy data @alex_combessie How Data Science can help energy companies map their infrastructure
  • 2. GRDF: Bringing natural gas everyday to 11 million people ● Engie: international gas & electricity company based in France, serving 70 countries with 60% renewable energy ● GRDF: subsidiary dedicated to the distribution of natural gas with 200’000 km of pipelines (largest in Europe!) ● Historical partnership with Dataiku since the creation of GRDF Data Lab in 2014 ○ Our very first energy client! ○ 24 business-driven projects ○ 10+ Data Scientists & 20+ business users
  • 3. How to better manage an infrastructure network inventory? ● Inventory management: need to regularly update the maintenance database to reflect truth ➡ Stakes of industrial security and financial compliance ● Budget of 1M€ per day dedicated to maintenance ● Let’s go merging (data sources)! GMAO + QE = RIO2 ● Challenge: no direct way to join the maintenance and client databases ➡ 460’000 addresses needing manual inspection to check for missing equipment Maintenance database Client database Inventory database
  • 4. Data Science to the rescue! ● Cost of one field visit: 25 € ● Optimizing the manual process to inspect the 460’000 addresses 1. Office verification to check if it exists in the maintenance database 2. If not, field visit (outsourced) 3. Update of the database ● Optimizing Step 1 allows to save office worker time, reduce the number of field visits, and avoid the creation of duplicates YOU ONLY VISIT ONCE Fuzzy matching to facilitate manual inspection OR DON’T VISIT AT ALL
  • 5. Fuzzy matching from inventory to maintenance database Fun stuff: making tables out of unstructured data ; SQL parallelization over ~100m rows ; Custom fuzzy matching script to avoid doing 100m x 100m computations Step 1 - Address matching Inventory Database (MongoDB) Maintenance Database (SAP ⇾ Exadata) Data preparation Modeling
  • 6. Distance metric on the spelling of the “Street” field 08/11/2017 ● Use of classic text mining techniques to define a distance metric on text fields ● Chosen metric: the normalized Levenshtein distance ● Computation for all addresses of the two databases in a given zipcode ● Chosen threshold at 0.8 to flag potential duplicates Custom rules for using numbers in a given street ● Refinement over the previously detected duplicates ● Comparison over the street numbers of the potential duplicates ○ Identical number: top priority ○ Same block (number +/- 10) ○ Missing data 0 1 Methodology for address matching
  • 7. ● Initial definition by Vladimir Levenshtein in 1966 ● Principle: counting the number of elementary operations (deletion, insertion, substitution) to get from one word to the other ● Implementation in Python using the open source library fuzzywuzzy optimized with C 08/11/2017 kittens ⇢ kitten (deletion) kitten ⇢ sitten (substitution) sitten ⇢ sittin (substitution) sittin ⇢ sitting (insertion) A distance metric over text? R.I.P. Prof. Levenshtein...
  • 8. Fuzzy matching from inventory to maintenance database Fun stuff: creative ways to compute metrics on industrial structure based on raw data ; finding the right distance metric for mixed-type data Step 2 - Industrial structure matching Inventory Database Maintenance Database Data preparation Modeling Duplicate Database
  • 9. 08/11/2017 ● 6 quantitative metrics: Number of Conduites Individuelles, Conduites Montantes, Conduites de Coursive, Branchements Particuliers, Nourrices, Tiges Cuisine ● 26 qualitative metrics: Building type, Basement existence, CI-CM accessibility, Pressure level, PBDI/DPBE existence , CM material, CM diameter, CI-CM Tap existence / type / location / brand, Box existence / type, Deposit plate existence / type, Regulator existence / number / pressure / type / brand, RDBP existence DDMP existence / type / brand / year, CM number, Lot number What defines the industrial structure?
  • 10. ● Challenge: how to define a normalized distance metric on mixed-type data with missing values ➡ Weighted Gower distance 08/11/2017 Metric RIO2 GMAO CM number 2 3 CI number 5 5 Nourrice number 1 n/a BP number 0 4 Regulator type Brand 3 Brand 3 Box location OUT OF LIST BURIED CM material Material 1 Material 3 Tap brand Brand 1 Brand 1 CM number n/a n/a … ? Defining a distance in non-Euclidean space
  • 11. 08/11/2017 ● Definition of two distinct distance metrics based on specific data: ○ Global variables ○ Regulator-related variables ● Manual adjustments to the weighting scheme: ○ Global metric: 50% of weights on regulator-related variables ○ Variable-specific weighting to penalize columns with high rate of missing values ● Custom filters based on the metrics values ○ Identical Lot / CM number (sure duplicate) ○ Global metric ≥ 0.8 ○ Regulator metric ≥ 0.8 Business-tuning the metric Distance = 1 - Proximity
  • 12. How I stopped worrying and learned to love messy data • Stop the blame game! • No free lunch in databases • Data quality is the common ground between business and science
  • 13. Just Start. Start Now. Fail often. Enjoy the ride. Seth Godin