Text analysis field provides us tools and techniques to extract information from documents. What if we apply them to a less conventional kind of text: source code. From what we know about Text Mining, we will see what can be reused in Code Mining and how the characteristics of programming language can change the way we automatically process source code.
What are the advantages and disadvantages of membrane structures.pptx
From Text Mining to Code Mining by Juliette Tisseyre
1. From Text Mining
to Code Mining
3. WiML&DS Paris,
Futurs, 25th
January, 2018
Juliette Tisseyre
Software engineer at Margo
www.margoconseil.com
2. Margo & CodeCase
Juliette Tisseyre
EPITA, specialisation in cognitive science
Software R&D engineer for CodeCase team, London
juliette.tisseyre@margoconseil.com @zanoellia
IT Consulting company @Margoconseil
300 consultants, revenue: 26 M€, Paris, London, Poland
Shortlisted in Palmarès “Champions de la croissance”, Les Echos, fév. 17
We simplify IT
We manage Safe, Qualitative and Cost Effective code modernisation projects
Migration & Refactoring - 70% automation ratio.
2
3. ● Introduction
● Code Mining unveiling
● Text Mining approach
● Solutions to limitations
● Conclusion
Agenda
3
4. Introduction
Everybody needs to ACCESS the knowledge to learn,
explain, control, decide, monetise…
But the knowledge is not only described in natural
languages. You can EXTRACT the knowledge from a less
conventional text: the CODE
4
6. ➙ Extract knowledge from source code
3,000 billions of running lines of code in the world
Likeness to Text Mining: terminology, steps, issues, applications.
Text mining ➙ have a machine understand text
Code mining ➙ have a human being understand code
Code Mining definition
6
7. Code source: structure parallel
7
Document Document
Chapter Class
Section
Method
Paragraph
Bloc
Sentence
Instruction
Word
(Key)word
8. 8
As viewed by a
programmer
As viewed by a
machine
Code source: duality
9. Global process
Before applying smart algorithms, the text / code must be
transformed into a model (features)
9
code
Reverse
engine model
● Business logic extraction, classification
● Automated migration / translation
● Search and indexing
● Detection of (anti) pattern or similarity
● Summary, algorithm visualisation
11. ● Treat code as simple text
● Extract natural language elements
● Name of code entities (variables,
● functions…)
● Comments, string content.
● Reuse of Text Mining techniques
● Similar challenges
● Infinite vocabulary
● Strong noise
● Not always understandable for a human
● Mix of languages can occur
Natural approach
11
12. Data cleaning
● What is relevant or not in the code?
● Generated code
● Technical frameworks
● Comments and names
● Useful code vs meaningless code
➙ Not a trivial task, depends on the objective
● Balance between cleaning and information loss
● Code structure and coding conventions can help to make choice
12
13. 13
Same business logic “open a file” but
● Two different languages
● Different verbosity level
What do we need to keep?
Java Python
Data cleaning: example
14. Natural approach: assessment
Good starting point but...
● Unable to solve all ambiguities
● Example: mathematical Log function versus logging module Log
● Construction of datasets for training is tricky
● Human subjectivity
● Open source vs corporate code
● Various results
● Very poor results for code transformation
● Too dependant on the code’s quality
14
16. Formal approach
● Treat code as a structure, no interest
in naming and comments
● Based on programming language
grammars: set of well defined and
unambiguous lexical, syntactic and
semantic rules
● Modelisation as AST or graph
16
18. Another powerful level of analysis:
● Only few ambiguities thanks to internal relationship knowledge
● Acceptable results for code modernisation
● Existing tools and algorithms for graph analysis
● ➙ Already existing tools using formal approach on code
But tough limitations:
● Unable to understand the meaning
● Poor results on business logic extraction
18
Formal approach: assessment
19. ➙ Mix the natural and formal approaches
Bottom up process:
● Rely on the code structure
● Text mining techniques to consolidate meaning
19
Early stage... to be challenged!
Hybrid approach
20. Conclusion
● Domain with growing needs and infinite applications
● Analysis performed at natural or formal level but rarely at both
● Lack of specific algorithms and techniques
● Low automation rate, human intervention
● No mature techniques
20
...amazing lands yet to be explored!
21. Questions?
21
“Any fool can write code that a computer can
understand.
Good programmers write code that humans
can understand.”
Martin Fowler
Juliette Tisseyre, Margo - CodeCase, London
juliette.tisseyre@margoconseil.com / @zanoellia