Parsing by Example

1,311 views

Published on

Workshop at TU Delft on Dec 13, 2011

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • I always though that we could try with some semantic knowledge of the structure of most programming languages.
    Humans can more or less parse even unknown programming languages because they know what to look for (classes, methods, functions, variables, expression, ...)
    Maybe we could put that kind of knowledge into an 'AI-parser' ...

    Nicolas Anquetil -- INRIA/RMod Team
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
1,311
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
27
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Parsing by Example

  1. 1. Parsing by Example Oscar Nierstrasz Software Composition Group scg.unibe.chMonday, 12 December 11
  2. 2. 2Monday, 12 December 11
  3. 3. 3Monday, 12 December 11
  4. 4. 4Monday, 12 December 11
  5. 5. Moose is a platform for software and data analysis www.moosetechnology.org 5Monday, 12 December 11
  6. 6. Model repository The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  7. 7. Model repository Navigation Metrics Querying Grouping The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  8. 8. Model repository Navigation Metrics Querying Grouping Smalltalk The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  9. 9. ConAn Van Hapax ... CodeCrawler Model repository Navigation Metrics Querying Grouping Smalltalk The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  10. 10. ConAn Van Hapax ... CodeCrawler Extensible meta model Model repository Navigation Metrics Querying Grouping Smalltalk The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  11. 11. ConAn Van Hapax ... CodeCrawler Smalltalk Extensible meta model Java Model repository COBOL Navigation C++ Metrics … Querying Grouping Smalltalk The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  12. 12. ConAn Van Hapax ... CodeCrawler Smalltalk Extensible meta model Java Model repository COBOL External CDIF Navigation Parser C++ Metrics … Querying Grouping XMI Smalltalk The Story of Moose, ESEC/FSE 2005 6Monday, 12 December 11
  13. 13. System complexity 7Monday, 12 December 11
  14. 14. Clone evolution 8Monday, 12 December 11
  15. 15. Class blueprint 9Monday, 12 December 11
  16. 16. Topic correlation matrix 10Monday, 12 December 11
  17. 17. Distribution map (topics spread over classes in packages 11Monday, 12 December 11
  18. 18. Hierarchy evolution view 12Monday, 12 December 11
  19. 19. Ownership map 13Monday, 12 December 11
  20. 20. 14Monday, 12 December 11
  21. 21. Mondrian: An Agile Visualization Framework, SoftVis 2006 15Monday, 12 December 11
  22. 22. 16Monday, 12 December 11
  23. 23. ConAn Van Hapax ... CodeCrawler Smalltalk Extensible meta model Java Model repository COBOL Navigation C++ Metrics … Querying Grouping Smalltalk 17Monday, 12 December 11
  24. 24. ConAn Van Hapax ... CodeCrawler Smalltalk Extensible meta model Java Model repository COBOL Navigation C++ Metrics … Querying Grouping Smalltalk But, we have a huge bottleneck for new languages ... 17Monday, 12 December 11
  25. 25. 18Monday, 12 December 11
  26. 26. 19Monday, 12 December 11
  27. 27. 20Monday, 12 December 11
  28. 28. 21Monday, 12 December 11
  29. 29. 22Monday, 12 December 11
  30. 30. 23Monday, 12 December 11
  31. 31. 24Monday, 12 December 11
  32. 32. Grammar Stealing 25Monday, 12 December 11
  33. 33. 26Monday, 12 December 11
  34. 34. No One case: Start perl Hard-coded Quality? Recover the parser Yes grammar Compiler sources? Yes BNF Recover the grammar No No cases No known General Quality? Recover the Language rules Yes grammar reference manual? Yes Constructions One case: by example RPG No No cases known Figure 2. Coverage diagram for grammar stealing. stance). Figure 2 shows the first two cases— ple, through general rules, or by both ap- the third is just a combination. If you start proaches. If a manual uses general rules, its with a hard-coded grammar, you must re- quality is generally not good: reference verse-engineer it from the handwritten code. manuals and language standards are full of Fortunately, the comments of such code of- errors. It is our experience that the myriad 27Monday, 12 December 11 ten include BNF rules (Backus Naur Forms) errors are repairable. As an aside, we once
  35. 35. No One case: Start perl Hard-coded Quality? Recover the parser Yes grammar Compiler sources? Yes BNF Recover the grammar No No cases No known General Quality? Recover the Language rules Yes grammar reference manual? Yes Constructions One case: by example RPG No No cases known Figure 2. Coverage diagram for grammar stealing. stance). Figure 2 shows the first two cases— ple, through general rules, or by both ap- Still takes a couple of weeks the third is just a combination. If you start proaches. If a manual uses general rules, its with a hard-coded grammar, you must re- quality is generally not good: reference and lots of expertise verse-engineer it from the handwritten code. manuals and language standards are full of Fortunately, the comments of such code of- errors. It is our experience that the myriad 27Monday, 12 December 11 ten include BNF rules (Backus Naur Forms) errors are repairable. As an aside, we once
  36. 36. hase will be a model of the Ruby software system. As the meta-model compliant, also the model will be. Information about the ClassLoader,nce responsible for loading Java classes, is covered in section 4.7.Fame framework automatically extracts a model from an instance of an Recycling TreesAST. This instance corresponds to the instance of the Ruby plugin AST ing the software system. Automation is possible due to the fact thated the higher level mapping. Figure 2.1 reveals the need for the higher to be restored. In order to implement the next phase independently environment used in this phase we extracted the model into an MSE Daniel Langone. Recycling Trees: Mapping Eclipse ASTs to Moose Models. Bachelors thesis, University of Bern 28 Monday, 12 December 11 1: The dotted lines correspond to the extraction of a (meta-)model.
  37. 37. 1.Infer AST implementation from IDE plugin 2.Extract metamodel from plugin 3.Map model elements to FAMIX (Moose) 29Monday, 12 December 11
  38. 38. Cool idea, but hard to make it work in practice 30Monday, 12 December 11
  39. 39. Parsing by Example Example-Driven Reconstruction of Software Models, CSMR 2007 31Monday, 12 December 11
  40. 40. parse 5 Source code specify import Parser examples 2 1 5 4 generate ... := ... ... ... ... infer | ... ... ... ... | ... ... ... ... 3 ... := ... ... ... ... Example mappings | ... ... ... ... | ... ... ... ... ... := ... ... ... ... | ... ... ... ... | ... ... ... ... export 6 Grammar m..n Model 1 32Monday, 12 December 11
  41. 41. CodeSnooper 33Monday, 12 December 11
  42. 42. 34Monday, 12 December 11
  43. 43. 34Monday, 12 December 11
  44. 44. In- solve these problems. retrieve the reference model by inspecting the source code methods and attributes, we are able to parse 7 of the 22 filel In a third iteration we add examples to recognize at- manually. tributes. Once again we obtain three parsers based on three Precise Model 7 files Our Mod In Ruby there are Classes and Modules. Modules are sets of examples for abstract classes, concrete classes and Number of collections We obtain theand Constants. They cannot gen- interfaces. of Methods following results: del erate instances. However they can be mixed into Classes Namespaces 8 6 6 and other Modules. A Module cannot inherit from anything. Number of Precise Model Our Model Model Classes 25 4 4 Modules also have the function of Namespaces. Ruby does Number of Model Classes 366 346 not supportof Abstract Classes Total Number of Number Abstract Classes [22]. 233 230 Methods 247 26 26 For the definition of the scanner tokens for identifiers and Total Number Of Methods 1887 1780 er, comments we use the following regular expressions: Total Number of Total Number of Attributes 395 304 Attributes 136 9 9on-ic) <IDENTIFIER >: [ a zA Z $ ] w⇤ ( ? | ! ) ? ; <comment>: can# repeated] to study This process JBoss case cover more and more of be [ ˆ r n ⇤ <eol> ; Problems are 4 large fi Amongst the files we could not parse, there of the subject language. The question on when to stop can be to answered with “When the results are good enough”.classes, • Ambiguity containing GUI code. If we ignore these files, we are a Using just 2 examples each of namespaces, Good to detect about 25% of the target elements.at- methodsin this context means whento parse 7 enough infor- enough and attributes, we are able we have of the 22 files. mation for a specific reverse engineering task. For example, • False positives There are two main reasons that so few files can be s cessfully parsed:ree a “System Complexity View” [18] is a 7 files Our used to Precise Model visualization Model • False negatives nd obtain an initial impression of a legacy software system. To Number of 1. The comment character # occurs frequently in stri generate such a view we need to parse a significant number Namespaces 8 6 6 • Embedded languages and regular expressions, causing our simple-min of the classes, identify subclass relations, and establish the scanner to fail. A better scanner would fix this pr Number of del numbers of methods and attributes of each class. Even if we lem. With some simple preprocessing (removing Model Classes 25 4 4 parse only 80% of the code, we can still get an initial im- hash character that occurs inside a string and remov Total Number of pression of the state of the system. If on the other we would all comments) we can improve recall to 65-85%. Methods 247 26 26 want to display a “Class Blueprint” [17], a semantically en- Total Number of 2. Ruby offers a very rich syntax for control constru riched visualization of the internal structure of classes we would need a refined grammar to extract 9 Attributes 136 9 more information. allowing the same keywords to occur in many diffe of The “good enough” is thus given by the reverse engineering positions and contexts. One would need many m Amongst the files we could not parse, there are 4 large files Ruby case study Markus Kobel. Parsing by Example. examples to recognize these constructs. be goals, which vary from case to case. containing GUI code. If we ignore these files, we are able MSc, U Bern, April 2005. od to detect about 25% of the target elements. or- There are two main reasons that so few files can be suc- le, cessfully parsed: 35 to Monday, 12 December 11
  45. 45. Evolutionary Grammar GenerationFOR GENETIC PROGRAMMING 27 C A B Sandro De Zanet. Grammar Generation with Genetic C Programming — Evolutionary Grammar Generation. MSc, U Bern, July 2009. A B Figure 4.5: Insert a node 36 Monday, 12 December 11
  46. 46. not introduce new information. Be aware that every modification of an individual has to result in a new individual that is valid. Validity is very dependent on the search space - it generally means that fitness function as well as the genetic operators should be applicable to a valid individual. A schematic view is shown in fig. 3.1. generate new random population fit enough? generate new select most fit population with individuals genetic operators mutation crossover Figure 3.1: Principles of an Evolutionary Algorithm There are alternatives to rejecting a certain number of badly performing individuals per generation. To compute the new generation, one can generate new individuals from all individuals of the old generation. This would not result in an improvement 37 since the selection is completely random. Hence the parent individuals are selectedMonday, 12 December 11
  47. 47. e more common mutations that affect the structure of the gram-ndent on the node. They will only affect not primitive and not PEG mutation and crossoveromly generated parser will be inserted in the listFigure 4.2: Add back link node of children (fig. 4.1. TUNING PEGS FOR GENETIC PROGRAMMING ... C A B 26 CHAPTER 4. COMBINATION OF PEGS AND GENETIC PROGRAMMING 4.1. TUNING PEGS FOR GENETIC PROGRAMMING ... 25 Figure 4.1: Add a node Figure 4.3: Delete a node C C rly to adding a child. Although in this case the new parser is notated but selected from ensure the nodes of the of more complex parsers we need more complex mutations. To one of the evolvability already existing in a link to this parser (like ainitialrule, fig. 4.2) got sorted mostly only single character parsers were left A After the CFG population B and couldn’t mutate to parsers with more nodes. A Bomly selected child will be removed. No effect, if there is only k that we don’t allow composite parsers withaddchildren, since to insert nodes between the current parser Insert a node The following mutations no the possibility Figure 4.5: and the root parser:tute a valid grammar (4.3) deletion The selected parser first moves all its children to the parent parser, thus ect on unary or primitive parsers like the character parser. replacing itself by its children (fig. 4.4) 4.1.3 Fitness function A B insertion The selected parser is replaced by a composite parser (sequence or choice). part of the evo The fitness function is the most important Figureselected back link node added to the new parser.4.4: Push a envisioned goal. It determines the q The 4.2: Add parser is then directly defining the node the insertion Figure This results in up of a new parser in between the selected representation ofparser. If the selected parser and its the distance to the solution. Without an ela parser is the root, the new parser becomessearch will head in the wrong direction. There is also the ri the new root. (fig. 4.5) Insertion ensures diversitysolutions depth. Graph depth local extrema. the emergence in graph or to be trapped in is important for 38 Monday, 12 December 11 of more complex structures. The deletion, on the other hand, puts a counterweigh ...
  48. 48. Desired grammar: ([a-z] (‘_‘ | [0-9] | [a-z])*) Slow and expensive. Modest results for Found grammar: complex languages. (([a-z] ({‘n‘ | ‘_‘ | [0-9]})*))* Desired grammar: 0 -> (‘c‘ ‘a‘ ‘t‘ ‘:‘ ‘ ‘ ([a-z])+ 1 -> {2 -> (‘n‘ 0) | e}) Found grammar: (0 -> (‘c‘ ‘a‘ ‘t‘ ‘:‘ ‘ ‘ 2 -> (([a-z])+ ‘n‘)))+ Desired grammar: 0 -> (‘+‘ | ‘-‘ | ‘<‘ | ‘>‘ | ‘,‘ | ‘.‘ | 1 -> (‘[‘ 2 -> (0)* ‘]‘)) Found grammar: (0 -> {‘<‘ | ‘]‘ | ‘.‘ | ‘,‘ | ‘>‘ | ‘-‘ | ‘[‘ | ‘+‘})* 39Monday, 12 December 11
  49. 49. What now? 40Monday, 12 December 11
  50. 50. Exploit indentation as a proxy for structure 41Monday, 12 December 11
  51. 51. Exploit indentation as a proxy for structure 42Monday, 12 December 11
  52. 52. Exploit indentation as a proxy for structure Exploit similarities between languages (adapt and compose) 42Monday, 12 December 11
  53. 53. Exploit indentation as a proxy for structure Exploit similarities between languages (adapt and compose) 43Monday, 12 December 11
  54. 54. Exploit indentation as a proxy for structure Exploit similarities between languages (adapt and compose) Incrementally refine island grammars 43Monday, 12 December 11
  55. 55. Exploit indentation as a proxy for structure Exploit similarities between languages (adapt and compose) Incrementally refine island grammars Ideas? ... 43Monday, 12 December 11

×