SlideShare a Scribd company logo
1 of 21
Download to read offline
From Text Mining
to Code Mining
3. WiML&DS Paris,
Futurs, 25th
January, 2018
Juliette Tisseyre
Software engineer at Margo
www.margoconseil.com
Margo & CodeCase
Juliette Tisseyre
EPITA, specialisation in cognitive science
Software R&D engineer for CodeCase team, London
juliette.tisseyre@margoconseil.com @zanoellia
IT Consulting company @Margoconseil
300 consultants, revenue: 26 M€, Paris, London, Poland
Shortlisted in Palmarès “Champions de la croissance”, Les Echos, fév. 17
We simplify IT
We manage Safe, Qualitative and Cost Effective code modernisation projects
Migration & Refactoring - 70% automation ratio.
2
● Introduction
● Code Mining unveiling
● Text Mining approach
● Solutions to limitations
● Conclusion
Agenda
3
Introduction
Everybody needs to ACCESS the knowledge to learn,
explain, control, decide, monetise…
But the knowledge is not only described in natural
languages. You can EXTRACT the knowledge from a less
conventional text: the CODE
4
Code Mining unveiling
➙ Extract knowledge from source code
3,000 billions of running lines of code in the world
Likeness to Text Mining: terminology, steps, issues, applications.
Text mining ➙ have a machine understand text
Code mining ➙ have a human being understand code
Code Mining definition
6
Code source: structure parallel
7
Document Document
Chapter Class
Section
Method
Paragraph
Bloc
Sentence
Instruction
Word
(Key)word
8
As viewed by a
programmer
As viewed by a
machine
Code source: duality
Global process
Before applying smart algorithms, the text / code must be
transformed into a model (features)
9
code
Reverse
engine model
● Business logic extraction, classification
● Automated migration / translation
● Search and indexing
● Detection of (anti) pattern or similarity
● Summary, algorithm visualisation
Text Mining approach
● Treat code as simple text
● Extract natural language elements
● Name of code entities (variables,
● functions…)
● Comments, string content.
● Reuse of Text Mining techniques
● Similar challenges
● Infinite vocabulary
● Strong noise
● Not always understandable for a human
● Mix of languages can occur
Natural approach
11
Data cleaning
● What is relevant or not in the code?
● Generated code
● Technical frameworks
● Comments and names
● Useful code vs meaningless code
➙ Not a trivial task, depends on the objective
● Balance between cleaning and information loss
● Code structure and coding conventions can help to make choice
12
13
Same business logic “open a file” but
● Two different languages
● Different verbosity level
What do we need to keep?
Java Python
Data cleaning: example
Natural approach: assessment
Good starting point but...
● Unable to solve all ambiguities
● Example: mathematical Log function versus logging module Log
● Construction of datasets for training is tricky
● Human subjectivity
● Open source vs corporate code
● Various results
● Very poor results for code transformation
● Too dependant on the code’s quality
14
Solutions to limitations
Formal approach
● Treat code as a structure, no interest
in naming and comments
● Based on programming language
grammars: set of well defined and
unambiguous lexical, syntactic and
semantic rules
● Modelisation as AST or graph
16
Formal approach: example
transformed into
17
Another powerful level of analysis:
● Only few ambiguities thanks to internal relationship knowledge
● Acceptable results for code modernisation
● Existing tools and algorithms for graph analysis
● ➙ Already existing tools using formal approach on code
But tough limitations:
● Unable to understand the meaning
● Poor results on business logic extraction
18
Formal approach: assessment
➙ Mix the natural and formal approaches
Bottom up process:
● Rely on the code structure
● Text mining techniques to consolidate meaning
19
Early stage... to be challenged!
Hybrid approach
Conclusion
● Domain with growing needs and infinite applications
● Analysis performed at natural or formal level but rarely at both
● Lack of specific algorithms and techniques
● Low automation rate, human intervention
● No mature techniques
20
...amazing lands yet to be explored!
Questions?
21
“Any fool can write code that a computer can
understand.
Good programmers write code that humans
can understand.”
Martin Fowler
Juliette Tisseyre, Margo - CodeCase, London
juliette.tisseyre@margoconseil.com / @zanoellia

More Related Content

Similar to From Text Mining to Code Mining by Juliette Tisseyre

coding article.pdf
coding article.pdfcoding article.pdf
coding article.pdfacelocale
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Kwame Porter Robinson
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as datasource{d}
 
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfOrtus Solutions, Corp
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!source{d}
 
Devday @ Sahaj - Domain Specific NLP Pipelines
Devday @ Sahaj -  Domain Specific NLP PipelinesDevday @ Sahaj -  Domain Specific NLP Pipelines
Devday @ Sahaj - Domain Specific NLP PipelinesRajesh Muppalla
 
The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017OW2
 
Big data101kagglepresentation
Big data101kagglepresentationBig data101kagglepresentation
Big data101kagglepresentationAlexandru Sisu
 
Pay off Technical Debt by Good Code
Pay off Technical Debt by Good CodePay off Technical Debt by Good Code
Pay off Technical Debt by Good CodeTung Nguyen
 
Chapter 2 Benefits of Learning to Code
Chapter 2 Benefits of Learning to CodeChapter 2 Benefits of Learning to Code
Chapter 2 Benefits of Learning to CodePro Guide
 
DRUPAL AUDITS MADE FASTR
DRUPAL AUDITS MADE FASTRDRUPAL AUDITS MADE FASTR
DRUPAL AUDITS MADE FASTRDrupalCamp Kyiv
 
Continuous Intelligence: Keeping your AI Application in Production
Continuous Intelligence: Keeping your AI Application in ProductionContinuous Intelligence: Keeping your AI Application in Production
Continuous Intelligence: Keeping your AI Application in ProductionDr. Arif Wider
 
Why is Python becoming indispensable in IoT Industry
Why is Python becoming indispensable in IoT IndustryWhy is Python becoming indispensable in IoT Industry
Why is Python becoming indispensable in IoT IndustryMindfire LLC
 
Harnessing the Power of Computer Vision and Deep Learning
Harnessing the Power of Computer Vision and  Deep LearningHarnessing the Power of Computer Vision and  Deep Learning
Harnessing the Power of Computer Vision and Deep LearningDusko Rakin
 
The Software Development Industry
The Software Development IndustryThe Software Development Industry
The Software Development IndustryOlivier Bourgeois
 
Legacy code - Taming The Beast
Legacy code  - Taming The BeastLegacy code  - Taming The Beast
Legacy code - Taming The BeastSARCCOM
 
Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...
Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...
Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...Virtual Forge
 

Similar to From Text Mining to Code Mining by Juliette Tisseyre (20)

coding article.pdf
coding article.pdfcoding article.pdf
coding article.pdf
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
source{d} Engine - your code as data
source{d} Engine - your code as datasource{d} Engine - your code as data
source{d} Engine - your code as data
 
why to do BCA course?
why to do BCA course?why to do BCA course?
why to do BCA course?
 
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
 
What's new in the latest source{d} releases!
What's new in the latest source{d} releases!What's new in the latest source{d} releases!
What's new in the latest source{d} releases!
 
Devday @ Sahaj - Domain Specific NLP Pipelines
Devday @ Sahaj -  Domain Specific NLP PipelinesDevday @ Sahaj -  Domain Specific NLP Pipelines
Devday @ Sahaj - Domain Specific NLP Pipelines
 
The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017The City of Paris and Open Source Software, Paris Open Source Summit 2017
The City of Paris and Open Source Software, Paris Open Source Summit 2017
 
Big data101kagglepresentation
Big data101kagglepresentationBig data101kagglepresentation
Big data101kagglepresentation
 
Pay off Technical Debt by Good Code
Pay off Technical Debt by Good CodePay off Technical Debt by Good Code
Pay off Technical Debt by Good Code
 
codex.pptx
codex.pptxcodex.pptx
codex.pptx
 
Chapter 2 Benefits of Learning to Code
Chapter 2 Benefits of Learning to CodeChapter 2 Benefits of Learning to Code
Chapter 2 Benefits of Learning to Code
 
DRUPAL AUDITS MADE FASTR
DRUPAL AUDITS MADE FASTRDRUPAL AUDITS MADE FASTR
DRUPAL AUDITS MADE FASTR
 
Continuous Intelligence: Keeping your AI Application in Production
Continuous Intelligence: Keeping your AI Application in ProductionContinuous Intelligence: Keeping your AI Application in Production
Continuous Intelligence: Keeping your AI Application in Production
 
Why is Python becoming indispensable in IoT Industry
Why is Python becoming indispensable in IoT IndustryWhy is Python becoming indispensable in IoT Industry
Why is Python becoming indispensable in IoT Industry
 
Harnessing the Power of Computer Vision and Deep Learning
Harnessing the Power of Computer Vision and  Deep LearningHarnessing the Power of Computer Vision and  Deep Learning
Harnessing the Power of Computer Vision and Deep Learning
 
The Software Development Industry
The Software Development IndustryThe Software Development Industry
The Software Development Industry
 
6yearsResume
6yearsResume6yearsResume
6yearsResume
 
Legacy code - Taming The Beast
Legacy code  - Taming The BeastLegacy code  - Taming The Beast
Legacy code - Taming The Beast
 
Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...
Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...
Case Study: Automated Code Reviews In A Grown SAP Application Landscape At EW...
 

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Managing international tech teams, by Natasha Dimban
Managing international tech teams, by Natasha DimbanManaging international tech teams, by Natasha Dimban
Managing international tech teams, by Natasha Dimban
 
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria KnorpsOptimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
 
Perspectives, by M. Pannegeon
Perspectives, by M. PannegeonPerspectives, by M. Pannegeon
Perspectives, by M. Pannegeon
 
Evaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled dataEvaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled data
 
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
Combinatorial Optimisation with Policy Adaptation using latent Space Search, ...
 
An age-old question, by Caroline Jean-Pierre
An age-old question, by Caroline Jean-PierreAn age-old question, by Caroline Jean-Pierre
An age-old question, by Caroline Jean-Pierre
 
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle LautréApplying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
Applying Churn Prediction Approaches to the Telecom Industry, by Joëlle Lautré
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
Global Ambitions Local Realities, by Anna Abreu
Global Ambitions Local Realities, by Anna AbreuGlobal Ambitions Local Realities, by Anna Abreu
Global Ambitions Local Realities, by Anna Abreu
 
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie DelonPlug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
 
Sales Forecasting as a Data Product by Francesca Iannuzzi
Sales Forecasting as a Data Product by Francesca IannuzziSales Forecasting as a Data Product by Francesca Iannuzzi
Sales Forecasting as a Data Product by Francesca Iannuzzi
 
Identifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta BinkyteIdentifying and mitigating bias in machine learning, by Ruta Binkyte
Identifying and mitigating bias in machine learning, by Ruta Binkyte
 
“Turning your ML algorithms into full web apps in no time with Python" by Mar...
“Turning your ML algorithms into full web apps in no time with Python" by Mar...“Turning your ML algorithms into full web apps in no time with Python" by Mar...
“Turning your ML algorithms into full web apps in no time with Python" by Mar...
 
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
Nature Language Processing for proteins by Amélie Héliou, Software Engineer @...
 
Sandrine Henry presents the BechdelAI project
Sandrine Henry presents the BechdelAI projectSandrine Henry presents the BechdelAI project
Sandrine Henry presents the BechdelAI project
 
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
Anastasiia Tryputen_War in Ukraine or how extraordinary courage reshapes geop...
 
Khrystyna Grynko WiMLDS - From marketing to Tech.pdf
Khrystyna Grynko WiMLDS - From marketing to Tech.pdfKhrystyna Grynko WiMLDS - From marketing to Tech.pdf
Khrystyna Grynko WiMLDS - From marketing to Tech.pdf
 
Iana Iatsun_ML in production_20Dec2022.pdf
Iana Iatsun_ML in production_20Dec2022.pdfIana Iatsun_ML in production_20Dec2022.pdf
Iana Iatsun_ML in production_20Dec2022.pdf
 
41 WiMLDS Kyiv Paris Poznan.pdf
41 WiMLDS Kyiv Paris Poznan.pdf41 WiMLDS Kyiv Paris Poznan.pdf
41 WiMLDS Kyiv Paris Poznan.pdf
 
Emergency plan to secure winter: what are the measures set up by RTE?
Emergency plan to secure winter: what are the measures set up by RTE?Emergency plan to secure winter: what are the measures set up by RTE?
Emergency plan to secure winter: what are the measures set up by RTE?
 

Recently uploaded

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...Call Girls in Nagpur High Profile
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 

Recently uploaded (20)

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 

From Text Mining to Code Mining by Juliette Tisseyre

  • 1. From Text Mining to Code Mining 3. WiML&DS Paris, Futurs, 25th January, 2018 Juliette Tisseyre Software engineer at Margo www.margoconseil.com
  • 2. Margo & CodeCase Juliette Tisseyre EPITA, specialisation in cognitive science Software R&D engineer for CodeCase team, London juliette.tisseyre@margoconseil.com @zanoellia IT Consulting company @Margoconseil 300 consultants, revenue: 26 M€, Paris, London, Poland Shortlisted in Palmarès “Champions de la croissance”, Les Echos, fév. 17 We simplify IT We manage Safe, Qualitative and Cost Effective code modernisation projects Migration & Refactoring - 70% automation ratio. 2
  • 3. ● Introduction ● Code Mining unveiling ● Text Mining approach ● Solutions to limitations ● Conclusion Agenda 3
  • 4. Introduction Everybody needs to ACCESS the knowledge to learn, explain, control, decide, monetise… But the knowledge is not only described in natural languages. You can EXTRACT the knowledge from a less conventional text: the CODE 4
  • 6. ➙ Extract knowledge from source code 3,000 billions of running lines of code in the world Likeness to Text Mining: terminology, steps, issues, applications. Text mining ➙ have a machine understand text Code mining ➙ have a human being understand code Code Mining definition 6
  • 7. Code source: structure parallel 7 Document Document Chapter Class Section Method Paragraph Bloc Sentence Instruction Word (Key)word
  • 8. 8 As viewed by a programmer As viewed by a machine Code source: duality
  • 9. Global process Before applying smart algorithms, the text / code must be transformed into a model (features) 9 code Reverse engine model ● Business logic extraction, classification ● Automated migration / translation ● Search and indexing ● Detection of (anti) pattern or similarity ● Summary, algorithm visualisation
  • 11. ● Treat code as simple text ● Extract natural language elements ● Name of code entities (variables, ● functions…) ● Comments, string content. ● Reuse of Text Mining techniques ● Similar challenges ● Infinite vocabulary ● Strong noise ● Not always understandable for a human ● Mix of languages can occur Natural approach 11
  • 12. Data cleaning ● What is relevant or not in the code? ● Generated code ● Technical frameworks ● Comments and names ● Useful code vs meaningless code ➙ Not a trivial task, depends on the objective ● Balance between cleaning and information loss ● Code structure and coding conventions can help to make choice 12
  • 13. 13 Same business logic “open a file” but ● Two different languages ● Different verbosity level What do we need to keep? Java Python Data cleaning: example
  • 14. Natural approach: assessment Good starting point but... ● Unable to solve all ambiguities ● Example: mathematical Log function versus logging module Log ● Construction of datasets for training is tricky ● Human subjectivity ● Open source vs corporate code ● Various results ● Very poor results for code transformation ● Too dependant on the code’s quality 14
  • 16. Formal approach ● Treat code as a structure, no interest in naming and comments ● Based on programming language grammars: set of well defined and unambiguous lexical, syntactic and semantic rules ● Modelisation as AST or graph 16
  • 18. Another powerful level of analysis: ● Only few ambiguities thanks to internal relationship knowledge ● Acceptable results for code modernisation ● Existing tools and algorithms for graph analysis ● ➙ Already existing tools using formal approach on code But tough limitations: ● Unable to understand the meaning ● Poor results on business logic extraction 18 Formal approach: assessment
  • 19. ➙ Mix the natural and formal approaches Bottom up process: ● Rely on the code structure ● Text mining techniques to consolidate meaning 19 Early stage... to be challenged! Hybrid approach
  • 20. Conclusion ● Domain with growing needs and infinite applications ● Analysis performed at natural or formal level but rarely at both ● Lack of specific algorithms and techniques ● Low automation rate, human intervention ● No mature techniques 20 ...amazing lands yet to be explored!
  • 21. Questions? 21 “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” Martin Fowler Juliette Tisseyre, Margo - CodeCase, London juliette.tisseyre@margoconseil.com / @zanoellia