From Text Mining to Code Mining by Juliette Tisseyre

From Text Mining
to Code Mining
3. WiML&DS Paris,
Futurs, 25th
January, 2018
Juliette Tisseyre
Software engineer at Margo
www.margoconseil.com

Margo & CodeCase
Juliette Tisseyre
EPITA, specialisation in cognitive science
Software R&D engineer for CodeCase team, London
juliette.tisseyre@margoconseil.com @zanoellia
IT Consulting company @Margoconseil
300 consultants, revenue: 26 M€, Paris, London, Poland
Shortlisted in Palmarès “Champions de la croissance”, Les Echos, fév. 17
We simplify IT
We manage Safe, Qualitative and Cost Effective code modernisation projects
Migration & Refactoring - 70% automation ratio.
2

● Introduction
● Code Mining unveiling
● Text Mining approach
● Solutions to limitations
● Conclusion
Agenda
3

Introduction
Everybody needs to ACCESS the knowledge to learn,
explain, control, decide, monetise…
But the knowledge is not only described in natural
languages. You can EXTRACT the knowledge from a less
conventional text: the CODE
4

➙ Extract knowledge from source code
3,000 billions of running lines of code in the world
Likeness to Text Mining: terminology, steps, issues, applications.
Text mining ➙ have a machine understand text
Code mining ➙ have a human being understand code
Code Mining definition
6

Code source: structure parallel
7
Document Document
Chapter Class
Section
Method
Paragraph
Bloc
Sentence
Instruction
Word
(Key)word

8
As viewed by a
programmer
As viewed by a
machine
Code source: duality

Global process
Before applying smart algorithms, the text / code must be
transformed into a model (features)
9
code
Reverse
engine model
● Business logic extraction, classification
● Automated migration / translation
● Search and indexing
● Detection of (anti) pattern or similarity
● Summary, algorithm visualisation

● Treat code as simple text
● Extract natural language elements
● Name of code entities (variables,
● functions…)
● Comments, string content.
● Reuse of Text Mining techniques
● Similar challenges
● Infinite vocabulary
● Strong noise
● Not always understandable for a human
● Mix of languages can occur
Natural approach
11

Data cleaning
● What is relevant or not in the code?
● Generated code
● Technical frameworks
● Comments and names
● Useful code vs meaningless code
➙ Not a trivial task, depends on the objective
● Balance between cleaning and information loss
● Code structure and coding conventions can help to make choice
12

13
Same business logic “open a file” but
● Two different languages
● Different verbosity level
What do we need to keep?
Java Python
Data cleaning: example

Natural approach: assessment
Good starting point but...
● Unable to solve all ambiguities
● Example: mathematical Log function versus logging module Log
● Construction of datasets for training is tricky
● Human subjectivity
● Open source vs corporate code
● Various results
● Very poor results for code transformation
● Too dependant on the code’s quality
14

Formal approach
● Treat code as a structure, no interest
in naming and comments
● Based on programming language
grammars: set of well defined and
unambiguous lexical, syntactic and
semantic rules
● Modelisation as AST or graph
16

Formal approach: example
transformed into
17

Another powerful level of analysis:
● Only few ambiguities thanks to internal relationship knowledge
● Acceptable results for code modernisation
● Existing tools and algorithms for graph analysis
● ➙ Already existing tools using formal approach on code
But tough limitations:
● Unable to understand the meaning
● Poor results on business logic extraction
18
Formal approach: assessment

➙ Mix the natural and formal approaches
Bottom up process:
● Rely on the code structure
● Text mining techniques to consolidate meaning
19
Early stage... to be challenged!
Hybrid approach

Conclusion
● Domain with growing needs and infinite applications
● Analysis performed at natural or formal level but rarely at both
● Lack of specific algorithms and techniques
● Low automation rate, human intervention
● No mature techniques
20
...amazing lands yet to be explored!

Questions?
21
“Any fool can write code that a computer can
understand.
Good programmers write code that humans
can understand.”
Martin Fowler
Juliette Tisseyre, Margo - CodeCase, London
juliette.tisseyre@margoconseil.com / @zanoellia

From Text Mining to Code Mining by Juliette Tisseyre

Recommended

Recommended

More Related Content

Similar to From Text Mining to Code Mining by Juliette Tisseyre

Similar to From Text Mining to Code Mining by Juliette Tisseyre (20)

More from Paris Women in Machine Learning and Data Science

More from Paris Women in Machine Learning and Data Science (20)

Recently uploaded

Recently uploaded (20)

From Text Mining to Code Mining by Juliette Tisseyre