Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reducing Redundancies in Multi-Revision Code Analysis

262 views

Published on

Software engineering research often requires analyzing
multiple revisions of several software projects, be it to make and
test predictions or to observe and identify patterns in how software evolves. However, code analysis tools are almost exclusively designed for the analysis of one specific version of the code, and the time and resources requirements grow linearly with each additional revision to be analyzed. Thus, code studies often observe a relatively small number of revisions and projects. Furthermore, each programming ecosystem provides dedicated tools, hence researchers typically only analyze code of one language, even when researching topics that should generalize
to other ecosystems. To alleviate these issues, frameworks and models have been developed to combine analysis tools or automate the analysis of multiple revisions, but little research has gone into actually removing redundancies in multi-revision, multi-language code analysis. We present a novel end-to-end approach that systematically avoids redundancies every step of the way: when reading sources from version control, during parsing, in the internal code representation, and during the actual analysis. We evaluate our open-source implementation, LISA, on the full
history of 300 projects, written in 3 different programming languages, computing basic code metrics for over 1.1 million program revisions. When analyzing many revisions, LISA requires less than a second on average to compute basic code metrics for all files in a single revision, even for projects consisting of millions of lines of code.

  • Be the first to comment

  • Be the first to like this

Reducing Redundancies in Multi-Revision Code Analysis

  1. 1. Carol V. Alexandru, Sebastiano Panichella, Harald C. Gall Software Evolution and Architecture Lab University of Zurich, Switzerland {alexandru,panichella,gall}@ifi.uzh.ch 22.02.2017 SANER’17 Klagenfurt, Austria
  2. 2. The Problem Domain • Static analysis (e.g. #Attr., McCabe, coupling...) 1
  3. 3. The Problem Domain v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0 2 • Static analysis (e.g. #Attr., McCabe, coupling...)
  4. 4. The Problem Domain v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0 2 • Static analysis (e.g. #Attr., McCabe, coupling...) • Many revisions, fine-grained historical data
  5. 5. A Typical Analysis Process 3 www clone select project
  6. 6. A Typical Analysis Process www clone checkout select project select revision 3
  7. 7. A Typical Analysis Process www clone checkout analysis tool Res select project select revision apply tool store analysis results 3
  8. 8. A Typical Analysis Process www clone checkout analysis tool Res select project select revision apply tool store analysis results more revisions? 3
  9. 9. A Typical Analysis Process www clone checkout analysis tool Res select project select revision apply tool store analysis results more revisions? more projects? 3
  10. 10. Redundancies all over... 4 Redundancies in historical code analysis Impact on Code Study Tools
  11. 11. Redundancies all over... Redundancies in historical code analysis Few files change Only small parts of a file change Across Revisions Impact on Code Study Tools 4
  12. 12. Redundancies all over... Redundancies in historical code analysis Few files change Only small parts of a file change Across Revisions Impact on Code Study Tools Repeated analysis of "known" code 4
  13. 13. Redundancies all over... Redundancies in historical code analysis Few files change Only small parts of a file change Changes may not even affect results Across Revisions Impact on Code Study Tools Repeated analysis of "known" code Storing redundant results 4
  14. 14. Redundancies all over... 4 Redundancies in historical code analysis Few files change Across Languages Only small parts of a file change Changes may not even affect results Each language has their own toolchain Yet they share many metrics Across Revisions Impact on Code Study Tools Repeated analysis of "known" code Storing redundant results
  15. 15. Redundancies all over... Redundancies in historical code analysis Few files change Across Languages Only small parts of a file change Changes may not even affect results Each language has their own toolchain Yet they share many metrics Across Revisions Impact on Code Study Tools Repeated analysis of "known" code Storing redundant results Re-implementing identical analyses Generalizability is expensive 4
  16. 16. Redundancies all over... Redundancies in historical code analysis Few files change Across Languages Only small parts of a file change Changes may not even affect results Each language has their own toolchain Yet they share many metrics Across Revisions Impact on Code Study Tools Repeated analysis of "known" code Storing redundant results Re-implementing identical analyses Generalizability is expensive 5
  17. 17. Important! 6 Techniques implemented in LISA Your favourite analysis features Pick what you like!
  18. 18. #1: Avoid Checkouts
  19. 19. Avoid checkouts 7 clone
  20. 20. Avoid checkouts 7 clone checkout read write
  21. 21. Avoid checkouts 7 clone checkout read read write analyze
  22. 22. Avoid checkouts 7 clone checkout read For every file: 2 read ops + 1 write op Checkout includes irrelevant files Need 1 CWD for every revision to be analyzed in parallel read write analyze
  23. 23. Avoid checkouts 8 clone analyze read
  24. 24. Avoid checkouts 8 clone analyze Only read relevant files in a single read op No write ops No overhead for parallization read
  25. 25. Avoid checkouts 8 clone analyze Only read relevant files in a single read op No write ops No overhead for parallization Git Analysis Tool File Abstraction Layer read
  26. 26. Avoid checkouts 8 clone analyze Only read relevant files in a single read op No write ops No overhead for parallization Git Analysis Tool File Abstraction Layer E.g. for the JDK Compiler: class JavaSourceFromCharrArray(name: String, val code: CharBuffer) extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) { override def getCharContent(): CharSequence = code } read
  27. 27. Avoid checkouts clone analyze Only read relevant files in a single read op No write ops No overhead for parallization Git Analysis Tool File Abstraction Layer E.g. for the JDK Compiler: class JavaSourceFromCharrArray(name: String, val code: CharBuffer) extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) { override def getCharContent(): CharSequence = code } read 9
  28. 28. #2: Use a multi-revision representation of your sources
  29. 29. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 10
  30. 30. rev. 1 rev. 2 rev. 3 rev. 4 11 rev. 1 Merge ASTs
  31. 31. rev. 1 rev. 2 rev. 3 rev. 4 12 rev. 2 Merge ASTs
  32. 32. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 13 rev. 3
  33. 33. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 14 rev. 4
  34. 34. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 15
  35. 35. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 16 rev. range [1-4] rev. range [1-2]
  36. 36. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 AspectJ (~440k LOC): 1 commit: 2.2M nodes All >7000 commits: 6.5M nodes 17
  37. 37. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 AspectJ (~440k LOC): 1 commit: 2.2M nodes All >7000 commits: 6.5M nodes 18
  38. 38. Merge ASTs rev. 1 rev. 2 rev. 3 rev. 4 AspectJ (~440k LOC): 1 commit: 2.2M nodes All >7000 commits: 6.5M nodes 19
  39. 39. #3: Store AST nodes only if they're needed for analysis
  40. 40. public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } } What's the complexity (1+#forks) and name for each method and class? 20
  41. 41. public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } } 140 AST nodes (using ANTLR) parse What's the complexity (1+#forks) and name for each method and class? 20
  42. 42. public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } } 140 AST nodes (using ANTLR) parse CompilationUnit TypeDeclaration Modifiers public Members Method Modifiers public Name run Name Demo Parameters ReturnType PrimitiveType VOID Body Statements ... ... What's the complexity (1+#forks) and name for each method and class? 20
  43. 43. public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } } What's the complexity (1+#forks) and name for each method and class? 140 AST nodes (using ANTLR) parse filtered parse TypeDeclaration Method Name run Name DemoForStatement IfStatement ConditionalExpression 7 AST nodes (using ANTLR) 21
  44. 44. public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } } What's the complexity (1+#forks) and name for eachmethod and class? 140 AST nodes (using ANTLR) parse filtered parse TypeDeclaration Method Name run Name DemoForStatement IfStatement ConditionalExpression 7 AST nodes (using ANTLR) 22
  45. 45. public class Demo { public void run() { for (int i = 1; i< 100; i++) { if (i % 3 == 0 || i % 5 == 0) { System.out.println(i) } } } What's the complexity (1+#forks) and name for eachmethod and class? 140 AST nodes (using ANTLR) parse filtered parse TypeDeclaration Method Name run Name DemoForStatement IfStatement ConditionalExpression 7 AST nodes (using ANTLR) 23
  46. 46. #4: Use non-duplicative data structures to store your results
  47. 47. rev. 1 rev. 2 rev. 3 rev. 4 24
  48. 48. rev. 1 rev. 2 rev. 3 rev. 4 24
  49. 49. rev. 1 rev. 2 rev. 3 rev. 4 24 [1-1] label #attr mcc [4-4] label #attr mcc InnerClass 0 4 1 2 4 [2-3] label #attr mcc
  50. 50. rev. 1 rev. 2 rev. 3 rev. 4 [1-1] label #attr mcc [4-4] label #attr mcc InnerClass 0 4 1 2 4 [2-3] label #attr mcc 25
  51. 51. LISA also does: #5: Parallel Parsing #6: Asynchronous graph computation #7: Generic graph computations applying to ASTs from compatible languages 26
  52. 52. To Summarize...
  53. 53. The LISA Analysis Process 27 www clone select project
  54. 54. The LISA Analysis Process 27 www clone parallel parse into merged graph select project Language Mappings (Grammar to Analysis) determines which AST nodes are loaded ANTLRv4 Grammar used by Generates Parser
  55. 55. The LISA Analysis Process 27 www clone parallel parse into merged graph Res select project store analysis results Analysis formulated as Graph Computation Language Mappings (Grammar to Analysis) used by Async. compute determines which AST nodes are loaded determines which data is persisted ANTLRv4 Grammar used by Generates Parser runs on graph
  56. 56. The LISA Analysis Process 27 www clone parallel parse into merged graph Res select project store analysis results more projects? Analysis formulated as Graph Computation Language Mappings (Grammar to Analysis) used by Async. compute determines which AST nodes are loaded determines which data is persisted ANTLRv4 Grammar used by Generates Parser runs on graph
  57. 57. How well does it work, then?
  58. 58. Marginal cost for +1 revision 41.670 4.633 0.525 0.109 0.082 0.071 0.052 0.041 0.032 0.033 0 5 10 15 20 25 30 35 40 45 50 1 10 100 1000 2000 3000 4000 5000 6000 7000 Average Parsing+Computation time per Revision when analyzing n revisions of AspectJ (10 common metrics) Seconds # of revisions 28
  59. 59. Overall Performance Stats 29 Language Java C# JavScript #Projects 100 100 100 #Revisions 646'261 489'764 204'301 #Files (parsed!) 3'235'852 3'234'178 507'612 #Lines (parsed!) 1'370'998'072 961'974'773 194'758'719 Total Runtime (RT)¹ 18:43h 52:12h 29:09h Median RT¹ 2:15min 4:54min 3:43min Tot. Avg. RT per Rev.² 84ms 401ms 531ms Med. Avg. RT per Rev.² 30ms 116ms 166ms ¹ Including cloning and persisting results ² Excluding cloning and persisting results
  60. 60. What's the catch? (There are a few...)
  61. 61. The (not so) minor stuff • Must implement analyses from scratch • No help from a compiler • Non-file-local analyses need some effort 30
  62. 62. The (not so) minor stuff • Must implement analyses from scratch • No help from a compiler • Non-file-local analyses need some effort • Moved files/methods etc. add overhead • Uniquely identifying files/entities is hard • (No impact on results, though) 30
  63. 63. Language matters 31 Javascript C# Java E.g.: Javascript takes longer because: • Larger files, less modularization • Slower parser (automatic semicolon-insertion)
  64. 64. LISA is EXTREME 32 complex feature-rich heavyweight simple generic lightweight
  65. 65. Thank you for your attention SANER ‘17, Klagenfurt, 22.02.2017 Read the paper: http://t.uzh.ch/Fj Try the tool: http://t.uzh.ch/Fk Get the slides: http://t.uzh.ch/Fm Contact me: alexandru@ifi.uzh.ch

×