Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Judge: Identifying, Understanding, and Evaluating Sources of Unsoundness in Call Graphs

64 views

Published on

Call graphs are widely used; in particular for advanced control- and data-flow analyses. Even though many call graph algorithms with different precision and scalability properties have been proposed, a comprehensive understanding of sources of unsoundness, their relevance, and the capabilities of existing call graph algorithms in this respect is missing. To address this problem, we propose Judge, a toolchain that helps with understanding sources of unsoundness and improving the soundness of call graphs. In several experiments, we use Judge and an extensive test suite related to sources of unsoundness to (a) compute capability profiles for call graph implementations of Soot, WALA, DOOP, and OPAL, (b) to determine the prevalence of language features and APIs that affect soundness in modern Java Bytecode, (c) to compare the call graphs of Soot, WALA, DOOP, and OPAL – highlighting important differences in their implementations, and (d) to evaluate the necessary effort to achieve project-specific reasonable sound call graphs. We show that soundness-relevant features/APIs are frequently used and that support for them differs vastly, up to the point where comparing call graphs computed by the same base algorithms (e.g., RTA) but different frameworks is bogus. We also show that Judge can support users in establishing the soundness of call graphs with reasonable effort.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Judge: Identifying, Understanding, and Evaluating Sources of Unsoundness in Call Graphs

  1. 1. Judge: Identifying, Understanding, and Evaluating Sources of Unsoundness in Call Graphs Michael Reif, Florian Kübler, Michael Eichberg, Dominik Helm, and Mira Mezini Software Technology Group TU Darmstadt @Reifmi
  2. 2. Why We Shouldn’t Take 
 Call Graphs for Granted • Call graphs are a central data-structure for numerous static analyses • Call graphs directly impact a client analysis’ result • The chosen algorithm predetermines an analysis’ precision and recall • Programming languages evolve (APIs and features are added) and frameworks might not !2
  3. 3. State-of-the-art Call-graph Generators for Java • Many different static analysis frameworks are available • All can compute a different set of call graphs • All frameworks use different approaches and make unknown trade-offs or implementation choices • Are they actually comparable?? !3 OPAL
  4. 4. Judge’s Overview TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3
  5. 5. Judge’s Overview TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets
  6. 6. Judge’s Overview TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects.
  7. 7. Judge’s Overview TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile
  8. 8. Test Suite TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile
  9. 9. Test Suite TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile • Each category has: • a description • multiple test cases • Each test case has: • a scenario description • unique id • the test code • excepted calls • Available annotations: • CallSite • IndirectCall
  10. 10. Test Suite Language Features • Static Initializer • Polymorphic Calls • Java 8 Polymorphic Calls • Lambdas/Method References • Signature Polymorphic Methods • Non-Java bytecode • … !6 APIs • Reflection • Unsafe • Serialization • Method Handles • Dynamic Proxies • Classloading • …
  11. 11. Computing the Algorithms’ Profile !7 TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile
  12. 12. TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile Finding Features in Real Code !8
  13. 13. TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile Finding Features in Real Code !8 [1] Reif, Michael et al. Hermes: assessment and creation of effective test corpora. SOAP ’17. ACM, 43–48. • We used Hermes [1], a static analysis code query infrastructure • Each query is an analysis that checks if a specific feature is found in a given code base • We developed 15 Hermes queries to derive 107 Hermes features and map the derived features to the test case ids • All queries perform a most-conservative intra-procedural analysis
  14. 14. Potential Sources of Unsoundness !9 0✘ Lambda8 (Invokedynamic - Scala) Lambda3 (Invokedynamic - Java ≤ 10) 1✓ … …… TR1 (Reflection) 2✘ Extensions Count 3 Supported by CG(a) ✓ BPC2 (Polymorphic Call) Features (Based on Test Cases) ✘mz my ✓ mx ✘ ✓mu …… m4 ✓ m3 ✓ m2 ✘ Reached by CG(a) ✓m1 Name Methods Computed Using Feature Queries / Hermes LibraryCodeApplicationCode Sourceof Unsoundness For Project (p) ConditionalSource ofUnsoundness Extensions Mapping TC1.jarTC2.jar⟨Test Case⟩ .jar ⟨Advanced Test Case⟩ .jar compile test cases AllTestCases <Test Fixtures Category>.md Test Case 1(TC1) … Test Case 3 (TCN) ⟨Test Fixtures⟩.md Test Case 1 … Test Case 3 ⟨CG⟩ .json compute CG Done for each CG per supported static analysis framework. ⟨CG Algorithm Profile⟩ .tsvcompute profile using CG and expected call targets ⟨Project⟩ .jar ⟨Features & Locations⟩ .json ⟨CG⟩ .json compute CG run Hermes Infrastructure used for computing the prevalence of features in real projects. ⟨Potential Sources of Unsoundness⟩ .tsv compute suitability of CG algo. use the respective CG profile • Sources of Unsoundness definitely make the call graph unsound • Conditional sources of Unsoundness might introduce unsoundness
  15. 15. Research Questions • RQ1: How prevalent are the language and API features? • RQ2: How do the frameworks compare to each other? • RQ3: Which framework is best suited for which kind of code base? • RQ4: How much effort is necessary to get a sound call graph? !10
  16. 16. Prevalent Language Features and APIs (RQ1) • All the API and language features supported by Java up to version 7 are used widely across all code bases • Support for Java 8 is a must, unless analyzing Android or Clojure code • Supporting classical Reflection and Serialization is strongly recommended, independent of the source code’s age • Support for many features is only required in specific scenarios !11
  17. 17. The Call Graphs’ Feature Support (RQ2) !12
  18. 18. The Call Graphs’ Feature Support (RQ2) !12
  19. 19. The Call Graphs’ Feature Support (RQ2) !12 Standard Java Features are well- supported
  20. 20. The Call Graphs’ Feature Support (RQ2) !12 Standard Java Features are well- supported
  21. 21. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported Standard Java Features are well- supported
  22. 22. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported Standard Java Features are well- supported
  23. 23. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported The JVM is not fully covered Standard Java Features are well- supported
  24. 24. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported The JVM is not fully covered Standard Java Features are well- supported
  25. 25. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported The JVM is not fully covered Standard Java Features are well- supported Reflection API partially supported
  26. 26. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported The JVM is not fully covered Standard Java Features are well- supported Reflection API partially supported
  27. 27. The Call Graphs’ Feature Support (RQ2) !12 Java 8 Features are partially supported The JVM is not fully covered Some APIs and language features are unsupported Standard Java Features are well- supported Reflection API partially supported
  28. 28. Performance Results (RQ2) !13
  29. 29. Performance Results (RQ2) !13
  30. 30. Performance Results (RQ2) !13 avg. Runtimes largely differ
  31. 31. Performance Results (RQ2) !13 avg. Runtimes largely differ
  32. 32. Performance Results (RQ2) !13 avg. Runtimes largely differ Reachable Methods vary even for implementations of the same algorithm by more than 20x
  33. 33. RTA-Example !14 void program(boolean condition){ Collection c1 = new LinkedList(); Collection c2; if(condition){ c2 = new ArrayList(); } else { c2 = new Vector(); } c2.add(null); Collection c3 = new HashSet(); } • RTA [2] depends on the program’s instantiated types • Soot, WALA, and OPAL behave complete differently [2] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual function calls. OOPSLA '96. ACM, 324-341.
  34. 34. RTA-Example !14 void program(boolean condition){ Collection c1 = new LinkedList(); Collection c2; if(condition){ c2 = new ArrayList(); } else { c2 = new Vector(); } c2.add(null); Collection c3 = new HashSet(); } • RTA [2] depends on the program’s instantiated types • Soot, WALA, and OPAL behave complete differently [2] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual function calls. OOPSLA '96. ACM, 324-341.
  35. 35. RTA-Example !14 void program(boolean condition){ Collection c1 = new LinkedList(); Collection c2; if(condition){ c2 = new ArrayList(); } else { c2 = new Vector(); } c2.add(null); Collection c3 = new HashSet(); } • RTA [2] depends on the program’s instantiated types • Soot, WALA, and OPAL behave complete differently [2] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual function calls. OOPSLA '96. ACM, 324-341. { LinkedList, ArrayList, Vector, HashSet }
  36. 36. RTA-Example !14 void program(boolean condition){ Collection c1 = new LinkedList(); Collection c2; if(condition){ c2 = new ArrayList(); } else { c2 = new Vector(); } c2.add(null); Collection c3 = new HashSet(); } • RTA [2] depends on the program’s instantiated types • Soot, WALA, and OPAL behave complete differently [2] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual function calls. OOPSLA '96. ACM, 324-341. { LinkedList, ArrayList, Vector, HashSet }
  37. 37. RTA-Example !14 void program(boolean condition){ Collection c1 = new LinkedList(); Collection c2; if(condition){ c2 = new ArrayList(); } else { c2 = new Vector(); } c2.add(null); Collection c3 = new HashSet(); } • RTA [2] depends on the program’s instantiated types • Soot, WALA, and OPAL behave complete differently [2] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual function calls. OOPSLA '96. ACM, 324-341. { LinkedList, ArrayList, Vector, HashSet } { LinkedList, ArrayList, Vector}
  38. 38. RTA-Example !14 void program(boolean condition){ Collection c1 = new LinkedList(); Collection c2; if(condition){ c2 = new ArrayList(); } else { c2 = new Vector(); } c2.add(null); Collection c3 = new HashSet(); } • RTA [2] depends on the program’s instantiated types • Soot, WALA, and OPAL behave complete differently [2] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual function calls. OOPSLA '96. ACM, 324-341. { LinkedList, ArrayList, Vector, HashSet } {ArrayList, Vector}{ LinkedList, ArrayList, Vector}
  39. 39. Project-specific Evaluation (RQ3) !15
  40. 40. Project-specific Evaluation (RQ3) !15
  41. 41. Project-specific Evaluation (RQ3) !15 Soot supports CSR but its expensive
  42. 42. Project-specific Evaluation (RQ3) !15 Soot supports CSR but its expensive
  43. 43. Project-specific Evaluation (RQ3) !15 Soot supports CSR but its expensive OPAL supports most features but has the smallest call graph
  44. 44. Project-specific Evaluation (RQ3) !15 Soot supports CSR but its expensive OPAL supports most features but has the smallest call graph OPAL covers only 47 methods from Xalan (~0.3%)
  45. 45. Project-specific Evaluation (RQ3) !15 Soot supports CSR but its expensive OPAL supports most features but has the smallest call graph OPAL covers only 47 methods from Xalan (~0.3%) Very few call sites have a huge impact
  46. 46. Is it worth it to do the work manually? (RQ 4) • GOAL: Get a reasonably sound call graph • JVM profiling and TamiFlex [3] as ground truth !16 [3] Bodden, Eric, et al. Taming Reflection--Static Analysis in the Presence of Reflection and Custom Class Loaders. (2010). Apply Judge Inspect Results Add Entry Points • Analyzed 10 reflective call sites • Added 50 entry points • manual analysis took roughly 90 minutes • The call graph then covered 91% of all methods contained in the profile and 121 from 198 reported by TamiFlex
  47. 47. !17
  48. 48. !17
  49. 49. !17
  50. 50. !17

×