Applying Compiler Techniques to Iterate at Blazing Speed@pascallouisjulien@kaching.com
Engineering at kaChingTDD from day oneFull regression suite runs in less than 3 minutesDeploy to production 30+ times a dayPeople have written and launched new features during interview process
AgendaApply Compiler Techniques?Profit!
Seriously…(Java focused)Software AnalysisAnatomy of a compilerCreating meta tests Leveraging TypesLevels of interpretationDescriptors and signaturesDRY your code (less bugs, greater reach for experts, higher testability)
Software AnalysisRunning a series of analyses on the code base.Catch common mistakes due to distracted developers, new hires or bad APIs.
Anatomy of a CompilerAnnotated Abstract Syntax TreeSemantic AnalysisIntermediate Representation GenerationIntermediate RepresentationAbstract Syntax TreeSyntactic AnalysisOptimizationOptimized Intermediate RepresentationTokensLexical AnalysisMachine Code GenerationTarget CodeSource Code
int x 1; int y x + 2;Lexical AnalysisIDENT(int) IDENT(x) ASGN NUMBER(1) SEMICOLONIDENT(int) IDENT(y) ASGN IDENT(x) PLUS NUMBER(2) SEMICOLON
IDENT(int) IDENT(x) ASGN NUMBER(1) SEMICOLONIDENT(int) IDENT(y) ASGN IDENT(x) PLUS NUMBER(2) SEMICOLONSyntactic AnalysisPROGRAM(  LET(x, int, 1),  LET(y, int, PLUS(x, 2)))
PROGRAM(  LET(x, int, 1),  LET(y, int, PLUS(x, 2)))Semantic Analysis: SymbolsPROGRAM(  LET(x, int, 1),  LET(y, int, PLUS(x, 2)))
PROGRAM(  LET(x, int, 1),  LET(y, int, PLUS(x, 2)))Semantic Analysis: TypesPROGRAM(  LET(x: int, int, 1: int),  LET(y: int, int, PLUS(x: int, 2: int): int))
OptimizationsSimple optimizations can be done on the Abstract Syntax TreeOther optimizations require specialized representationsStatic Single Assignment formControl Flow Graph
PROGRAM(  LET(x: int, int, 1: int),  LET(y: int, int, PLUS(3: int, 2: int): int))Constant foldingPROGRAM(  LET(x: int, int, 1: int),  LET(y: int, int, 5: int))
int x1 (a1 + b1) / c1; int y1(a1+ b1) / d1;Common sub-expression eliminationint temp  a + b;int x  temp / d;inty  temp / d;
int x1 (a1 + b1) / c1;a2a1 + 1;int y1(a2+ b1) / d1;Common sub-expression eliminationintx (a + b) / c;a  a + 1;inty (a + b) / d;
Anatomy of a CompilerSemantic AnalysisIntermediate Representation GenerationSyntactic AnalysisOptimizationLexical AnalysisMachine Code Generation
javacTarget CodeSource Code
PMDAnnotated Abstract Syntax TreeSource Code
joeqIntermediate RepresentationSource Code
scalacSemantic AnalysisIntermediate Representation GenerationSyntactic AnalysisOptimizationLexical AnalysisMachine Code Generation
Finding bad code snippetsDescribe bad code snippets using regular expressions.Analysis done on the source code, before lexical analysis as information such as whitespaces are lost. Extremely easy to implement.
@CodeSnippets({@Check(paths = {"src", "srctest"}, snippets = {@Snippet("\\bif\\("),@Snippet("super\\(\\)")    },@Check(paths = {"srctest"}, snippets = {@Snippet("@Ignore\\s ")    })})@RunWith(BadCodeSnippetsRunner.class)public classBadCodeSnippetsTest {}
for(Snippet s : snippets) {if(patterns.get(s).matcher(line).find()) {uses.get(s).add(file);    }}
Forbidden Callsscala>BigDecimal.valueOf(1).equals(BigDecimal.valueOf(1.0))res1: Boolean = falsescala>BigDecimal.valueOf(1).compareTo(BigDecimal.valueOf(1.0)) == 0res2: Boolean = true
@ForbiddenCalls( {@Check(paths = {"bin"}, forbiddenMethods = {"java.math.BigDecimal#equals(java.lang.Object)",    })})@RunWith(ForbiddenCallsTestRunner.class)public class ForbiddenCallsTest {}
Finding Forbidden CallsMust be done after the typed AST is created.  a.equals(b)
voiddoStuff(BigDecimal a, BigDecimal b) {boolean c = a.equals(b); }Code:   0:	aload_1   1:	aload_2   2:	invokevirtual	#2;   5:	astore_3  6:       return
voiddoStuff(BigDecimal a, BigDecimal b) {boolean c = a.equals(b); }Code:   0:	aload_1   1:	aload_22:	invokevirtual	#2;   5:	astore_3  6:       return2:0xb63:          0x004:          0x02
const #2   = Method                  #14.#15;const #14 = class                        #18;const #15 = NameAndType      #19:#20const #18 = Ascizjava/math/BigDecimal;const#19 = Ascizequals;const #20 = Asciz(Ljava/lang/Object;)Z;
voiddoStuff(BigDecimal a, BigDecimal b) {boolean c = a.equals(b); }Code:   0:	aload_1   1:	aload_2   2:	invokevirtualjava/math/BigDecimal.equals(Ljava/lang/Object;)Z   5:	astore_3  6:       return
ClassFile {    u4 magic;    u2 minor_version;   u2 major_version;    u2 constant_pool_count;cp_infoconstant_pool[constant_pool_count-1];    u2 access_flags;    u2 this_class;    u2 super_class;    u2 interfaces_count;    u2 interfaces[interfaces_count];    u2 fields_count;field_infofields[fields_count];    u2 methods_count;method_infomethods[methods_count];    u2 attributes_count;attribute_infoattributes[attributes_count];}
ClassFile {    u4 magic;    u2 minor_version;   u2 major_version;    u2 constant_pool_count;cp_infoconstant_pool[constant_pool_count-1];u2 access_flags;    u2 this_class;    u2 super_class;    u2 interfaces_count;    u2 interfaces[interfaces_count];    u2 fields_count;field_infofields[fields_count];    u2 methods_count;method_infomethods[methods_count];    u2 attributes_count;attribute_infoattributes[attributes_count];}const #18 = Ascizjava/math/BigDecimal;const #19 = Ascizequals;const #20 = Asciz(Ljava/lang/Object;)Z;
ClassFile {    u4 magic;    u2 minor_version;   u2 major_version;    u2 constant_pool_count;cp_infoconstant_pool[constant_pool_count-1];    u2 access_flags;    u2 this_class;    u2 super_class;    u2 interfaces_count;    u2 interfaces[interfaces_count];    u2 fields_count;field_infofields[fields_count];    u2 methods_count;method_infomethods[methods_count];u2 attributes_count;attribute_infoattributes[attributes_count];}method_info {    u2 access_flags;    u2 name_index;    u2 descriptor_index;    u2 attributes_count;attribute_infoattributes[attributes_count];}
method_info{    u2 access_flags;    u2 name_index;    u2 descriptor_index;    u2 attributes_count;attribute_info attributes[attributes_count];}Code_attribute {    u2 attribute_name_index;    u4 attribute_length;    u2 max_stack;    u2 max_locals;    u4 code_length;    u1 code[code_length];    u2 exception_table_length;    {        u2 start_pc;        u2 end_pc;        u2 handler_pc;        u2 catch_type;    } exception_table[exception_table_length];    u2 attributes_count;attribute_infoattributes[attributes_count];}
Code_attribute {    u2 attribute_name_index;    u4 attribute_length;    u2 max_stack;    u2 max_locals;    u4 code_length;    u1 code[code_length];u2 exception_table_length;    {        u2 start_pc;        u2 end_pc;        u2 handler_pc;        u2 catch_type;    } exception_table[exception_table_length];    u2 attributes_count;attribute_info attributes[attributes_count];} 0:	aload_11:	aload_22:	invokevirtual   #2;5: 	astore_36:	return
ASM@OverridepublicvoidvisitMethodInsn(intopcode,        String owner,        String name,        String descriptor) {   ….}
ASM@OverridepublicvoidvisitMethodInsn(intopcode,                            0xb6       String owner, "java.math.BigDecimal"       String name,                          "equals"       String descriptor) {               "(Ljava/lang/Object;)Z"   ….}
Failed Assertionjunit.framework.AssertionFailedError:com.kaching.trading.core.Trade#execute()   calls java.math.BigDecimal#equals(java.lang.Object)   on line 273
Visibility Testclass Lists {   …@VisibleForTestingstatic intcomputeArrayListCapacity(int size) {return (int) Math.min(           5L + size + (size / 10), Integer.MAX_VALUE);  }}
Visibility Testclass QuoteHttpClientimplementsQuoteClient {@Inject HttpClientclient;   Quote getQuote(Symbol<?> symbol) {return …;  }}
@Visibilities({@Check(paths = {"bin"}, visibilities = {@Visibility(value = VisibleForTesting.class, intent = PRIVATE),@Visibility(value = Inject.class, intent = PRIVATE)   })})@RunWith(VisibilityTestRunner.class)public class VisibilityTest {}
Two PassesFind all classes, fields and methods annotated with the specified annotations.Find all instructions referring to these classes, fields and methods.
ASM@OverridepublicAnnotationVisitorvisitAnnotation(   String descriptor, booleanvisible) {    …}
ASM@OverridepublicAnnotationVisitorvisitAnnotation(   String descriptor,      "Lcom/google/common/annotations/VisibleForTesting;"booleanvisible) {      false    …}
booleanisVisibleBy(ParsedElement location,ParsedClasscurrentClass) {    Annotation annotation= annotations.get(location);if(annotation != null) {       …    } else {returntrue; // let's trust the compiler :)}}
Failed Assertionjunit.framework.AssertionFailedError:com.kaching.account. ApplicationUploader#upload()   refers to @VisibleForTesting methodcom.kaching.account.Customer#getState() on line 149
Java 5+ Type SystemPrimitivesObjectsGenericsWildcardsIntersection types
ErasureObject eraser() {returnnewArrayList<String>();}Object obj = eraser();// impossible to recover that obj is a listof string
Compiled classes$ cat MustBeSerializable.javaimportjava.io.Serializable;interfaceMustBeSerializable<T extendsSerializable> {}$cat ExtendsMustBeSerializable.javaclass Value {}classExtendsMustBeSerializableimplementsMustBeSerializable<Value> {}
Compiled classes$javacMustBeSerializable.java$rm MustBeSerializable.java$lsMustBeSerializable.classExtendsMustBeSerializable.java $javac ExtendsMustBeSerializable.java -cp .ExtendsMustBeSerializable.java:2: type parameter Value is not within its boundclassExtendsMustBeSerializableimplementsMustBeSerializable<Value> {}                                                              ^1 error
Compiled classesCompiler must write type information in class file for proper semanticsWhen compiling other classes, need to read those type information and check against those contracts
Taking a peek at classes$javap -v MustBeSerializable | grep -A 1 'Signature;'const #3 = Asciz        Signature;const #4 = Asciz        <T::Ljava/io/Serializable;>Ljava/lang/Object;;
Signatures
SignaturesPrimitivesB for byte, C for char, D for double, …ObjectsLclassname; such as Ljava/lang/String;Arrays[B for byte[], [[D for double[][]VoidV… 8 pages of documentation
With ASMorg.objectweb.asm.signature.*org.objectweb.asm.Type
With Reflectionjava.lang.reflect.TypeClassGenericArrayType(Type component)ParametrizedType(Type raw, Type[] arguments)TypeVariable(Type[] bounds, Sting name)WildcardType(    Type[] lowerBounds, Type[] upperBounds)
Some ExamplesString.class         Class<String>List<Integer> ParametrizedType(List.class, Integer.class)List<int[]> ParametrizedType(List.class, GenericArrayType(int.class))
Some ExamplesMap<? extends Shape, ? super Area> ParametrizedType(Map.class, {WildcardType({}, Shape.class),WildcardType(Area.class, {})        })
With Reflectionjava.lang.ClassgetGenericSuperclass()getGenericInterfaces()
Concrete ExamplesUnificationJust-In-Time Providers
UnificationMyClassimplements Callable<String> { …MyClass.class   .getGenericInterfaces()[0]    .getActualTypeArguments()[0] String.class!
UnificationBut if we haveMyClassextendsAbstractCallable<String> { …AbstractCallable<T> implements Callable<T> { …Unification.getActualTypeArgument(MyClass.class, Callable.class, 0);
Unification – Want to Try?class MergeOfIntegerAndStringextends Merge<Integer, String> {}class Merge<K, V> implementsOneTypeParam<Map<K, V>> {} interfaceOneTypeParam<T>
Guiceclass QuoteHttpClientimplementsQuoteClient {@Inject HttpClientclient;    Quote getQuote(Symbol<?> symbol) {return …;    }}
Providersbind(Repository.class)    .toProvider(new Provider<Repository>() {        Repository get() {            return new RepositoryImpl(…);        }    });
Tedious Providersbind(new TypeLiteral<Marshaller<User>>() {})    .toProvider(new Provider<…>() {        Marshaller<User> get() {            return TwoLattes                  .createMarshaller(User.class);        }    });
It’s get tedious…@Inject Marshaller<User>@Inject Marshaller<Portfolio>@Inject Marshaller<Watchlist>….  Lots and lots and lots of bindings
TwoLatter.createMarshaller(Foo.class)Just-In-Time ProvidersbindJit(new TypeLiteral<Marshaller<?>>() {})    .toProvider(new JitProvider<…>() {        Marshaller<?> get(Key key) {             return TwoLattes.createMarhsaller(extractClassFromKey(key));        }    });
Just-In-Time ProvidersPattern matching on typesMarshaller<?> is a pattern forMarshaller<User>, Marshaller<Portfolio>, …Can be arbitrary complex, including wildcards, intersection types etc.http://github.com/pascallouisperez/guice-jit-providers
ToolsPMD http://pmd.sourceforge.net/Javassisthttp://www.csg.is.titech.ac.jp/~chiba/javassist/FindBugshttp://findbugs.sourceforge.net/Joeqhttp://suif.stanford.edu/~courses/cs243/joeq/index.htmlASM http://asm.ow2.org/Guicehttp://code.google.com/p/google-guice/
ReferencesJVM spechttp://www.amazon.com/Java-Virtual-Machine-Specification-2nd/dp/0201432943Class File spec http://java.sun.com/docs/books/jvms/second_edition/ClassFileFormat-Java5.pdfSuper Type Tokenshttp://gafter.blogspot.com/2006/12/super-type-tokens.htmlUnifying Type Parameters in Javahttp://eng.kaching.com/2009/10/unifying-type-parameters-in-java.htmlType Safe Bit Fields Using Higher Kinded Typeshttp://eng.kaching.com/2010/08/type-safe-bit-fields-using-higher.html

Applying Compiler Techniques to Iterate At Blazing Speed

Editor's Notes

  • #7 A token is a string of characters, categorized according to the rules as a symbol.
  • #13 Abstract interpretation
  • #14 n compiler design, static single assignment form (often abbreviated as SSA form or simply SSA) is an intermediate representation (IR) in which every variable is assigned exactly once. Existing variables in the original IR are split into versions, new variables typically indicated by the original name with a subscript, so that every definition gets its own version
  • #15 In functional language compilers, such as those for Scheme, ML and Haskell, continuation-passing style (CPS) is generally used where one might expect to find SSA in a compiler for Fortran or C. SSA is formally equivalent to a well-behaved subset of CPS
  • #16 A token is a string of characters, categorized according to the rules as a symbol.
  • #17 A token is a string of characters, categorized according to the rules as a symbol.
  • #18 A token is a string of characters, categorized according to the rules as a symbol.
  • #19 A token is a string of characters, categorized according to the rules as a symbol.
  • #20 A token is a string of characters, categorized according to the rules as a symbol.
  • #29 Internal name of the method’s owner class, method’s name and method’s descriptor.
  • #45 Pretty tricky