Slides from the paper presented at the 2011 IEEE Intl Conf on Mining Software Repositories, by Julius Davies, Daniel German, Mike Godfrey, and Abram Hindle
Software Bertillonage: Finding the Provenance of an Entity
1. Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle
2.
3. “Provenance” Originally, used for art / antiques, but now used in science and IT: Data provenance / audit trails Component provenance for security … but what about source code artifacts? A set of documentary evidence pertaining to the origin, history, or ownership of an artifact. [From “provenir”, French for “to come from”]
5. Bertillonage metrics Height Stretch: Length of body from left shoulder to right middle finger when arm is raised Bust: Length of torso from head to seat, taken when seated Length of head: Crown to forehead Width of head: Temple to temple Length of right ear Length of left foot Length of left middle finger Length of left cubit: Elbow to tip of middle finger Width of cheeks
6. Forensic Bertillonage “Quick and dirty”, and a giant step forward Can narrow huge set of candidate mugshots down to a small handful! Some problems: Equipment + training required Very sensitive to measurement error The metrics are not independent! … and fingerprints are much more accurate
7. Software Bertillonage We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) Who are you, really? Where did you come from? Does your mother know you’re here? It’s not fingerprinting or DNA analysis! You may be looking for a cousin or ancestor A good software Bertillonage metric should: be computationally inexpensive be applicable to the desired level of granularity / prog. language catch most of the bad guys (recall) significantly reduce the search space (precision)
10. A real-world problem Software packages often bundle in third-party libraries to avoid “DLL-hell”[Di Penta-10] In Java world, jars may include library source code or just byte code Included libs may include other libs too! Payment Card Industry Data Security Std (PCI-DSS), Req #6: “All critical systems must have the most recently released, appropriate software patches to protect against exploitation and compromise of cardholder data.” What if a financial software package doesn’t explicitly list the version IDs of its included libraries?
11. Identifying included libraries The version ID may be embedded in the name of the component! e.g., commons-codec-1.1.jar … but often the version info is simply not there, or it’s wrong, …. Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10] Will return only product, not version Compare against all known compiled binaries But compilers, build-time compilation options may differ
12. commons-codec-1.1.jar public class org.apache.commons.codec.binary.Base64 implements BinaryEncoder, BinaryDecoder public <init>() private static boolean isBase64(byte) public static boolean isArrayByteBase64(byte[]) public static byte[] encodeBase64(byte[]) public static byte[] encodeBase64Chunked(byte[]) public Object decode(Object) throws DecoderException public byte[] decode(byte[]) throws DecoderException public static byte[] encodeBase64(byte[],boolean) public static byte[] decodeBase64(byte[]) static byte[] discardWhitespace(byte[]) public Object encode(Object) throws EncoderException public byte[] encode(byte[]) throws EncoderException
13. commons-codec-1.2.jar public class org.apache.commons.codec.binary.Base64 implements BinaryEncoder, BinaryDecoder public <init>() private static boolean isBase64(byte) public static boolean isArrayByteBase64(byte[]) public static byte[] encodeBase64(byte[]) public static byte[] encodeBase64Chunked(byte[]) public Object decode(Object) throws DecoderException public byte[] decode(byte[]) public static byte[] encodeBase64(byte[],boolean) public static byte[] decodeBase64(byte[]) static byte[] discardWhitespace(byte[]) static byte[] discardNonBase64(byte[]) public Object encode(Object) throws EncoderException public byte[] encode(byte[])
14. Anchored class signatures Idea: Compile / acquire all known lib versions but extract only the signatures, then compare against target binary Shouldn’t vary by compiler/build settings For a class C with methods M1, … , Mn, we define its anchored class signatureas: θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ For an archive A composed of classes C1,…,Ck, we define its anchored class signature as θ(A) = {θ(C1 ), ..., θ(Ck )} θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ θ(A) = {θ(C1 ), ..., θ(Ck )}
15. // This is **decompiled** source!! package a.b; public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac } synchronized static int a (java.lang.String s) throws a.b.E { // decompiled byte code omitted } } σ(C) = public class a.b.C extends Object implements I σ(M1 ) = public C() σ(M2 ) = default synchronized static int a(String) throws E θ(C) = ⟨σ(C), ⟨σ(M1 ), σ(M2 )⟩⟩
16. Archive similarity We define the similarity index of two archives as their Jaccard coefficient: We define the inclusion index of two archives as:
18. Implementation Created byte code (bcel5) and source code signature extractors Used SHA1 hash for class signatures to improve performance We don’t care about near misses at the method or class level! Built corpus from Maven2 jar repository Maven is unversioned + volatile! 150 GB of jars, zips, tarballs, etc., 130,000 binary jars (75,000 unique) 26M .class files, 4M .java source files (incl. duplicates) Archives contain archives: 75,000 classes are nested 4 levels deep!
20. Investigation Target system: An industrial e-commerce app containing 84 jars RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?
21. Investigation RQ1: How useful is the signature similarity index at finding the original binary archive for a given binary archive? 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0. 48 exact, 3 correct product (target version not in Maven) 20 / 84 we found multiple matches with simIndex= 1.0 19 exact, 1 correct product 12 / 84 we found matches with 0 < simIndex<1.0 1 exact, 9 correct product, 2 incorrect product 1 / 84 we found no match (product not in Maven as a binary) More data here: http://juliusdavies.ca/uvic/jarchive/
22. Further uses Used the version info from extracted from e-commerce app to perform audits for licensing and security One jar changed open source licenses (GNU Affero, LGPL) One jar version was found to have known security bugs When did Google Android developers copy-paste httpclient.jar classes into android.jar? And how much work would it be to include a newer version? We narrowed it down to two likely candidates, one of which turned out to be correct.
23. Summary Who are you? Determining the provenance of software entities is a growing and important problem Software Bertillonage: Quick and dirty techniques applied widely, then expensive techniques applied narrowly Identifying version IDs of included Java libraries is an example of the software provenance problem And our solution of anchored signature matching is an example of software Bertillonage
24. Production babies(born the week the paper was due) Uncle Mike and Lilia Biswas (who was born in Mike’s basement) Dad Julius and Naoki
25. Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle
26. Investigation RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? 22 / 84 source jars (26.2%), we found a single candidate from the corpus with similarity index of 1.0. 13 exact, 2 correct product 7 / 84 we found multiple matches with simIndex= 1.0 6 exact, 1 correct product 46 / 84 we found matches with 0 < simIndex < 1.0 25 exact, 20 correct product, 1 incorrect product 16 / 84 we found no match (product not in Maven as source) More data here: http://juliusdavies.ca/uvic/jarchive/
27. Investigation RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? Found exact match or correct product 81 / 84 times (96.4%) RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? Found exact match or correct product 57 / 84 times (67.9%)
28. Summary:Anchored signature matching If the master repository is rich enough, anchored signature matching can be highly accurate for both source and binary Tho might have to examine multiple candidates with perfect scores (median: 4, max: 30) Also works well if product present, but target version missing The approach is fast and simple, once the repository has been built
29.
30. Forensic Bertillonage Some problems … Equipment was cumbersome, expensive, required training Measurement error, consistency The metrics were not independent! Adoption (and later abandonment) … but overall it was a big success! Quick and dirty, and a huge leap forward Some training and tools required but could be performed with technology of late 1800s If done accurately, could quickly narrow down a very large pool of mugshots to only a handful
31. Bertillonage desiderata A good Bertillonage metric should: be computationally inexpensive be applicable to the desired level of granularity / prog. language catch most of the bad guys (recall) significantly reduce the search space (precision) Why not “fingerprinting” or “DNA analysis”? Often there just is not enough info (or too much noise) to make conclusive identification So we hope to reduce the candidate set so that manual examination is feasible
32. Software Bertillonage We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) Who are you, really? Entity and relationship analysis Where did you come from? Evolutionary history Does your mother know you’re here? Licensing
Editor's Notes
Diana and Actaeon by Titian has a full provenance covering its passage through several owners and four countries since it was painted for Philip II of Spain in the 1550s. Currently, it sits in the National Gallery, London (when not on tour).
Invented a biometrics-based system for indexing criminal recordsAlso invented the mug shot, crime scene photography, forensic document analysis …
You would think that the similarity index would be enough … but a lot of source archives contain test classes that don’t ship with the final product
Maven2 repository acts as the Java community’s de facto library archive. The repository was originally developed as a place from where the Maven build system could download required libraries to build and compile an application. Thanks to its broad coverage and depth, many competing java build systems and dependency resolvers currently make use of it.
For 51: 48 were “perfect” (correct product and version), 3 were correct product, but Maven did not have that versionFor 20: 19 contained perfect match in result set, 1 Maven did not have that version. Perfect match set was small for all but 2 or 3 where match set of over 30 candidates was produced
For RQ1, when similarity index = 1.0, median number of cases to check was 4, max was 30
Bertillon noticed that while everyone has eyes, nose, mouth, ears, etc, there are many variations on shapes and sizes. Perhaps this could be used as an organizing principle for mugshots. He also invented the modern mugshot, realizing that the combination of head-on and profile, done systematically, could aid in creating a useful photo repository.