Hermes:
Towards Representative
Benchmarks
Collaborative work by many PhDs and Post Docs!
Dr. Michael Eichberg (mail@michael-eichberg.de)

Technische Universität Darmstadt
http://www.opal-project.de/Hermes.html
https://twitter.com/@TheOpalProject
https://gitter.im/OPAL-Project/Hermes
OOPSLA 2000
“ The benchmarks in our evaluation are
meant to be representative of real
applications and they include four SPECjvm
benchmarks plus three other large, object-
oriented benchmarks. ”
Empirical Software Engineering 2011
“ For our study, we consider the API
documentation of packages from the
official Java Development Kit, the Eclipse
project, and from the Apache foundation.
[…] By choosing libraries that cover a wide
range of use cases we try to alleviate the
risk that the study is biased towards a
particular domain. The decision is based on
the evidential observation that the level
and the quality of their documentation is
often high.”
FSE 2015
“ In the third research question, we looked at
the performance […] on realistic
programs. We selected five realistic apps
from different sources.”
0
75
150
225
300
Reflection BookStore a2z add.me.fast pipefilter
size/execution time of analysis
Why?
FSE 2019
“ We evaluate IT on three well-known
benchmark suites— DaCapo 2006,
Dacapo-9.12-MR1-bach and ScalaBench.
Additionally, we use two real-world
performance bug datasets.…”
Qualitas Corpus,

XCorpus,

…

It’s huge, but….

is it representative?
ant informa luceneantlr ireport marauroaaoi itext mavenargouml ivatagroupware megamekaspectj jFin_DateMath mvnforumaxion jag myfaces_coreazureus james nakedobjectsbatik jasml nekohtmlc_jdbc jasperreports netbeanscastor javacc openjmscayenne jboss oscachecheckstyle jchempaint picocontainercobertura jedit pmdcollections jena poicolt jext pookacolumba jfreechart proguardcompiere jgraph quartzderby jgraphpad quickserverdisplaytag jgrapht quiltdrawswf jgroups rollerdrjava jhotdraw rssowleclipse_SDK jmeter sableccemma jmoney sandmarkexoportal joggplayer springframeworkfindbugs jparse squirrel_sqlfitjava jpf strutsfreecol jre tapestryfreecs jrefactory
Properties of the

Qualitas Corpus (Sept. 2013)
• No usage of Java FX (already became available in 2008)

• Only one project (Hibernate) made use of Java 7 features

• (There is a lot of redundancy.)
Feature-oriented
Benchmarks
• Securibench

• Pointerbench

• …
Features are
not distributed
uniformly!
Judge: Identifying,
Understanding, and
Evaluating Sources of
Unsoundness in Call
Graphs
Michael Reif, Florian Kübler, Michael
Eichberg, Dominik Helm, Mira Mezini

ISSTA’19
WALA0-1-CFA—all congured with the FULL reection option. WALA
requires to specify packages to be excluded from the analysis. For
the comparative analysis (Experiment 2 (see 4.3)) we excluded no
package, whereas for the experiment related to RQ3 we use the pre-
dened Java60RegressionExclusions to ensure termination. For all
Soot call graphs (SootCHA, SootRTA, SootVTA, and SootSPARK [26])
we use the options: safe-forname and safe-newinstance. This options
make Soot consider all types as instantiated when Class.forName
or Class.newInstance is used. We could not use types-for-invoke due
exceptions being thrown [41]. Furthermore, we use include-all to
ensure that no packages are ltered. Our library test cases are addi-
tionally started with library:signature-resolution and all-reachable
to make use of Soot’s capabilities to analyze library code. DOOPCI’s
call graph is set to be context-insensitive with classical-reection
turned on. For OPALRTA, we use the standard conguration.
All test cases w.r.t. libraries are started with the respective library
entry points. We perform all experiments on a server with two Intel
Xeon E5-2620 CPUs and 64 GB RAM.
4.2 Experiment 1
Our corpus for analyzing the prevalence of language and API fea-
tures (RQ1) includes the XCorpus [13], the top 50 libraries from
Maven Central [31] (from July 2018), the top 15 apps from Google’s
Playstore (from January 2018), plus ve Clojure [20], Groovy [32],
Kotlin [16], and Scala [24] projects.
Table 2 visualizes the results using a heatmap. It shows the
relative frequency of each feature (cf. Feature column) within each
corpus. We include the OpenJDK column as a separate corpus
because most corpus projects are built upon it and, hence, partially
use its features. A feature’s relative frequency is color coded using a
logarithmic scale as shown in the legend of Table 2. Slightly yellow
boxes (⌅) identify unused features and red boxes (⌅) those found
in 5% of all methods; we chose 5% because only 7 features occur
in more than 5% of all methods. Features used in no corpus (e.g.,
Groovy invokedynamics, or the serialization of lambdas) and always
soundly resolved features (e.g., standard poly-/monomorphic call)
are not included.
⌅ All the API and language features supported by Java up to version
7 are used widely across all code bases.
The most frequently used feature that was introduced with Java
8 is the call of static interface methods (J8DIM6). 12% of all
methods of the top 50 Maven projects use them; Scalatest [22] is
responsible for ⇡ 90% of all uses. Clojure and Android code have
not yet adapted Java 8 call semantics. Other Java 8 features, e.g.,
MethodHandle constants, are rarely used; primarily by the Nashorn
library.
⌅ Support for Java 8 is a must, given the frequent use of Java 8 call
semantics features in modern code (J8DIMX), unless one analyzes
only Android or Clojure code.
Serialization-related functionality (Ser3-7,9, ExtSer) and Java’s
Reection API (cf. TR, LRR, CSR) are both used with medium fre-
0%
0.00006%0.0001%0.0002%0.0005%0.0009%0.0018%0.0037%0.0073%0.0148%0.0295%0.059%0.1181%0.2367%0.4724%0.9448%1.8895%
5%
— log2(n)
Feature
OpenJDK8
XCorpusTop50MavenScala
GroovyKotlin
ClojureAndroid
Feature
OpenJDK8
XCorpusTop50MavenScala
GroovyKotlin
ClojureAndroid
J8DIM1 ExtSer2
J8DIM2 ExtSer3
J8DIM3 SI1
J8DIM4 SI2
J8DIM5 SI3
J8DIM6 SI4
JVMC1 SI5
JVMC2 SI6
JVMC3 SI7
JVMC4 SI8
JVMC5 MH1.1
Lambda1 MH1.2
Lambda2 MH2
Lambda3 MH3
Lambda4 MH4
Lambda5 MH5
Lambda6 MH6
Lambda7 MH7
LIB3 MH8
LIB4 TR1
LIB5 TR2
MR1 TR3
MR2 TR4
MR3 TR5
MR4 TR6
MR5 TR7
MR6 TR8
MR7 TR9
Native LRR1
Ser1 LRR2
Ser3 CSR1+CSR2
Ser4 LRR3+CSR3
Ser5 Unsafe1
Ser6 Unsafe2
Ser7 Unsafe3
Ser8 Unsafe4
Ser9 Unsafe5
ExtSer1 Unsafe6
Unsafe7
Getting Representative and
relevant Benchmarks
1. Determine the population
2. Get a (manageable) representative subset
2.1.determine the relevant technical features
2.2.compute the subset…
2.2.1.for development and testing purposes

2.2.2.for a comprehensive evaluation
3.M
ake
itpublicly
available!
Hermes
Assessment and Creation of Effective Test Corpora

(consisting of projects available as Java bytecode)
Facilitating the
Development and Evaluation of
Code Analyses
Idea
Prototyping
D
evelopm
ent
Evaluation
Paper
OPAL
Hermes in a Nutshell
Corpus candidates Hermes Optimal corpus
Feature Queries
Manual and/or
Automatic Selection
using Choco Solver
General Purpose Queries
Existence of 

Bytecode Instructions
Class File Versions
Class Types
Trivial Reflection
OO Metrics 

(Fan-In/Fan-Out,…)
Field Access
Method w/o Returns
Call Graph Related
Method Types
Recursive 

Data Structures
Size of

Inheritance Tree
API Usage
Feature Queries Relevant for
Constructing Call Graphs
Static Initializer
Lambdas
Unsafe API
Polymorphic Calls
Method Reference
Serialization
Java 8 Polymorphic Calls
Signature Polymorphic
Methods
Non-Java bytecode
It’s
Extensible!
Feature Queries for 

API Usage
Bytecode 

Instrumentation
Class Loader
GUI
Crypto
JDBC
Reflection
System
Thread
Unsafe
It’s
Extensible!
Feature Queries
trait FeatureQuery {
// …
def apply[S](
projectConfiguration: ProjectConfiguration,
project: Project[S],
rawClassFiles: Traversable[(da.ClassFile, S)]
): TraversableOnce[Feature[S]]
// …
}
Identifier,
Project JAR Files,
Library JAR Files,
Statistics
Complete reified
project information
(classes, fields,
methods, bodys, etc.)
Raw class file information
(e.g., for extracting
information from the
constant pool)List of detected features in
the codebase (id, frequency
of occurrence, (opt.)
locations)
Feature
abstract case class Feature[S] private (
id: String,
count: Int,
extensions: Chain[Location[S]]
) {
…
}
The name of the
feature.
How often the feature
was found.
Where the feature was
found.
Hermes - Example
Constructing a 

Minimal Corpus
• Hidden Truths in Dead Software Paths @ FSE 15

• Original evaluation and development conducted on the
complete Qualitas Corpus

• Minimal corpus computed by Hermes using all available
queries only consists of 5 out of the 100 projects in the
Qualitas Corpus

• Evaluation cut down from 16.77 minutes to 2.82 minutes
(~6x faster) while coverage is only 1.06% below the
original corpus
• Can help you to build a specifically targeted benchmark

• Hermes Data (i.e., all current feature queries) for all of
Maven Central

• For details talk to Ben :-)
Delphi

Opal Hermes - towards representative benchmarks

  • 1.
    Hermes: Towards Representative Benchmarks Collaborative workby many PhDs and Post Docs! Dr. Michael Eichberg (mail@michael-eichberg.de) Technische Universität Darmstadt http://www.opal-project.de/Hermes.html https://twitter.com/@TheOpalProject https://gitter.im/OPAL-Project/Hermes
  • 2.
    OOPSLA 2000 “ Thebenchmarks in our evaluation are meant to be representative of real applications and they include four SPECjvm benchmarks plus three other large, object- oriented benchmarks. ”
  • 3.
    Empirical Software Engineering2011 “ For our study, we consider the API documentation of packages from the official Java Development Kit, the Eclipse project, and from the Apache foundation. […] By choosing libraries that cover a wide range of use cases we try to alleviate the risk that the study is biased towards a particular domain. The decision is based on the evidential observation that the level and the quality of their documentation is often high.”
  • 4.
    FSE 2015 “ Inthe third research question, we looked at the performance […] on realistic programs. We selected five realistic apps from different sources.” 0 75 150 225 300 Reflection BookStore a2z add.me.fast pipefilter size/execution time of analysis Why?
  • 5.
    FSE 2019 “ Weevaluate IT on three well-known benchmark suites— DaCapo 2006, Dacapo-9.12-MR1-bach and ScalaBench. Additionally, we use two real-world performance bug datasets.…”
  • 6.
    Qualitas Corpus, XCorpus, … It’s huge,but…. is it representative? ant informa luceneantlr ireport marauroaaoi itext mavenargouml ivatagroupware megamekaspectj jFin_DateMath mvnforumaxion jag myfaces_coreazureus james nakedobjectsbatik jasml nekohtmlc_jdbc jasperreports netbeanscastor javacc openjmscayenne jboss oscachecheckstyle jchempaint picocontainercobertura jedit pmdcollections jena poicolt jext pookacolumba jfreechart proguardcompiere jgraph quartzderby jgraphpad quickserverdisplaytag jgrapht quiltdrawswf jgroups rollerdrjava jhotdraw rssowleclipse_SDK jmeter sableccemma jmoney sandmarkexoportal joggplayer springframeworkfindbugs jparse squirrel_sqlfitjava jpf strutsfreecol jre tapestryfreecs jrefactory
  • 7.
    Properties of the
 QualitasCorpus (Sept. 2013) • No usage of Java FX (already became available in 2008) • Only one project (Hibernate) made use of Java 7 features • (There is a lot of redundancy.)
  • 8.
  • 9.
    Features are not distributed uniformly! Judge:Identifying, Understanding, and Evaluating Sources of Unsoundness in Call Graphs Michael Reif, Florian Kübler, Michael Eichberg, Dominik Helm, Mira Mezini ISSTA’19 WALA0-1-CFA—all congured with the FULL reection option. WALA requires to specify packages to be excluded from the analysis. For the comparative analysis (Experiment 2 (see 4.3)) we excluded no package, whereas for the experiment related to RQ3 we use the pre- dened Java60RegressionExclusions to ensure termination. For all Soot call graphs (SootCHA, SootRTA, SootVTA, and SootSPARK [26]) we use the options: safe-forname and safe-newinstance. This options make Soot consider all types as instantiated when Class.forName or Class.newInstance is used. We could not use types-for-invoke due exceptions being thrown [41]. Furthermore, we use include-all to ensure that no packages are ltered. Our library test cases are addi- tionally started with library:signature-resolution and all-reachable to make use of Soot’s capabilities to analyze library code. DOOPCI’s call graph is set to be context-insensitive with classical-reection turned on. For OPALRTA, we use the standard conguration. All test cases w.r.t. libraries are started with the respective library entry points. We perform all experiments on a server with two Intel Xeon E5-2620 CPUs and 64 GB RAM. 4.2 Experiment 1 Our corpus for analyzing the prevalence of language and API fea- tures (RQ1) includes the XCorpus [13], the top 50 libraries from Maven Central [31] (from July 2018), the top 15 apps from Google’s Playstore (from January 2018), plus ve Clojure [20], Groovy [32], Kotlin [16], and Scala [24] projects. Table 2 visualizes the results using a heatmap. It shows the relative frequency of each feature (cf. Feature column) within each corpus. We include the OpenJDK column as a separate corpus because most corpus projects are built upon it and, hence, partially use its features. A feature’s relative frequency is color coded using a logarithmic scale as shown in the legend of Table 2. Slightly yellow boxes (⌅) identify unused features and red boxes (⌅) those found in 5% of all methods; we chose 5% because only 7 features occur in more than 5% of all methods. Features used in no corpus (e.g., Groovy invokedynamics, or the serialization of lambdas) and always soundly resolved features (e.g., standard poly-/monomorphic call) are not included. ⌅ All the API and language features supported by Java up to version 7 are used widely across all code bases. The most frequently used feature that was introduced with Java 8 is the call of static interface methods (J8DIM6). 12% of all methods of the top 50 Maven projects use them; Scalatest [22] is responsible for ⇡ 90% of all uses. Clojure and Android code have not yet adapted Java 8 call semantics. Other Java 8 features, e.g., MethodHandle constants, are rarely used; primarily by the Nashorn library. ⌅ Support for Java 8 is a must, given the frequent use of Java 8 call semantics features in modern code (J8DIMX), unless one analyzes only Android or Clojure code. Serialization-related functionality (Ser3-7,9, ExtSer) and Java’s Reection API (cf. TR, LRR, CSR) are both used with medium fre- 0% 0.00006%0.0001%0.0002%0.0005%0.0009%0.0018%0.0037%0.0073%0.0148%0.0295%0.059%0.1181%0.2367%0.4724%0.9448%1.8895% 5% — log2(n) Feature OpenJDK8 XCorpusTop50MavenScala GroovyKotlin ClojureAndroid Feature OpenJDK8 XCorpusTop50MavenScala GroovyKotlin ClojureAndroid J8DIM1 ExtSer2 J8DIM2 ExtSer3 J8DIM3 SI1 J8DIM4 SI2 J8DIM5 SI3 J8DIM6 SI4 JVMC1 SI5 JVMC2 SI6 JVMC3 SI7 JVMC4 SI8 JVMC5 MH1.1 Lambda1 MH1.2 Lambda2 MH2 Lambda3 MH3 Lambda4 MH4 Lambda5 MH5 Lambda6 MH6 Lambda7 MH7 LIB3 MH8 LIB4 TR1 LIB5 TR2 MR1 TR3 MR2 TR4 MR3 TR5 MR4 TR6 MR5 TR7 MR6 TR8 MR7 TR9 Native LRR1 Ser1 LRR2 Ser3 CSR1+CSR2 Ser4 LRR3+CSR3 Ser5 Unsafe1 Ser6 Unsafe2 Ser7 Unsafe3 Ser8 Unsafe4 Ser9 Unsafe5 ExtSer1 Unsafe6 Unsafe7
  • 10.
    Getting Representative and relevantBenchmarks 1. Determine the population 2. Get a (manageable) representative subset 2.1.determine the relevant technical features 2.2.compute the subset… 2.2.1.for development and testing purposes 2.2.2.for a comprehensive evaluation 3.M ake itpublicly available!
  • 11.
    Hermes Assessment and Creationof Effective Test Corpora
 (consisting of projects available as Java bytecode)
  • 12.
    Facilitating the Development andEvaluation of Code Analyses Idea Prototyping D evelopm ent Evaluation Paper
  • 13.
    OPAL Hermes in aNutshell Corpus candidates Hermes Optimal corpus Feature Queries Manual and/or Automatic Selection using Choco Solver
  • 14.
    General Purpose Queries Existenceof 
 Bytecode Instructions Class File Versions Class Types Trivial Reflection OO Metrics 
 (Fan-In/Fan-Out,…) Field Access Method w/o Returns Call Graph Related Method Types Recursive 
 Data Structures Size of
 Inheritance Tree API Usage
  • 15.
    Feature Queries Relevantfor Constructing Call Graphs Static Initializer Lambdas Unsafe API Polymorphic Calls Method Reference Serialization Java 8 Polymorphic Calls Signature Polymorphic Methods Non-Java bytecode It’s Extensible!
  • 16.
    Feature Queries for
 API Usage Bytecode 
 Instrumentation Class Loader GUI Crypto JDBC Reflection System Thread Unsafe It’s Extensible!
  • 17.
    Feature Queries trait FeatureQuery{ // … def apply[S]( projectConfiguration: ProjectConfiguration, project: Project[S], rawClassFiles: Traversable[(da.ClassFile, S)] ): TraversableOnce[Feature[S]] // … } Identifier, Project JAR Files, Library JAR Files, Statistics Complete reified project information (classes, fields, methods, bodys, etc.) Raw class file information (e.g., for extracting information from the constant pool)List of detected features in the codebase (id, frequency of occurrence, (opt.) locations)
  • 18.
    Feature abstract case classFeature[S] private ( id: String, count: Int, extensions: Chain[Location[S]] ) { … } The name of the feature. How often the feature was found. Where the feature was found.
  • 19.
  • 20.
    Constructing a 
 MinimalCorpus • Hidden Truths in Dead Software Paths @ FSE 15 • Original evaluation and development conducted on the complete Qualitas Corpus • Minimal corpus computed by Hermes using all available queries only consists of 5 out of the 100 projects in the Qualitas Corpus • Evaluation cut down from 16.77 minutes to 2.82 minutes (~6x faster) while coverage is only 1.06% below the original corpus
  • 21.
    • Can helpyou to build a specifically targeted benchmark • Hermes Data (i.e., all current feature queries) for all of Maven Central • For details talk to Ben :-) Delphi