SlideShare a Scribd company logo
1 of 25
Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle
“Provenance” Originally, used for art / antiques, but now used in science and IT: Data provenance / audit trails Component provenance for security … but what about source code artifacts?  A set of documentary evidence pertaining to  the origin, history, or ownership of an artifact. [From “provenir”, French for “to come from”]
Who are you? Alphonse Bertillon(1853-1914)
Bertillonage metrics Height Stretch: Length of body from left shoulder to right middle finger when arm is raised Bust: Length of torso from head to seat, taken when seated Length of head: Crown to forehead Width of head: Temple to temple Length of right ear Length of left foot Length of left middle finger Length of left cubit: Elbow to tip of middle finger Width of cheeks
Forensic Bertillonage “Quick and dirty”, and a giant step forward Can narrow huge set of candidate mugshots down to a small handful! Some problems: Equipment + training required Very sensitive to measurement error The metrics are not independent! … and fingerprints are much more accurate
Software Bertillonage We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) Who are you, really?   Where did you come from? Does your mother know you’re here? It’s not fingerprinting or DNA analysis! You may be looking for a cousin or ancestor A good software Bertillonage metric should: be computationally inexpensive be applicable to the desired level of granularity / prog. language catch most of the bad guys (recall) significantly reduce the search space (precision)
Bertillonage meta-techniques
Software BertillonageA motivating example
A real-world problem Software packages often bundle in third-party libraries to avoid “DLL-hell”[Di Penta-10] In Java world, jars may include library source code or just byte code Included libs may include other libs too! Payment Card Industry Data Security Std (PCI-DSS), Req #6: “All critical systems must have the most recently released, appropriate software patches to protect against exploitation and compromise of cardholder data.” What if a financial software package doesn’t explicitly list the version IDs of its included libraries?
Identifying included libraries The version ID may be embedded in the name of the component! e.g., commons-codec-1.1.jar … but often the version info is simply not there, or it’s wrong, …. Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10]  Will return only product, not version Compare against all known compiled binaries But compilers, build-time compilation options may differ
commons-codec-1.1.jar public class org.apache.commons.codec.binary.Base64          implements BinaryEncoder, BinaryDecoder     public <init>()     private static boolean isBase64(byte)     public static boolean isArrayByteBase64(byte[])     public static byte[] encodeBase64(byte[])     public static byte[] encodeBase64Chunked(byte[])     public Object decode(Object) throws DecoderException     public byte[] decode(byte[]) throws DecoderException     public static byte[] encodeBase64(byte[],boolean)     public static byte[] decodeBase64(byte[])     static byte[] discardWhitespace(byte[])     public Object encode(Object) throws EncoderException     public byte[] encode(byte[]) throws EncoderException
commons-codec-1.2.jar public class org.apache.commons.codec.binary.Base64         implements BinaryEncoder, BinaryDecoder     public <init>()     private static boolean isBase64(byte)     public static boolean isArrayByteBase64(byte[])     public static byte[] encodeBase64(byte[])     public static byte[] encodeBase64Chunked(byte[])     public Object decode(Object) throws DecoderException     public byte[] decode(byte[])     public static byte[] encodeBase64(byte[],boolean)     public static byte[] decodeBase64(byte[])     static byte[] discardWhitespace(byte[])     static byte[] discardNonBase64(byte[])     public Object encode(Object) throws EncoderException     public byte[] encode(byte[])
Anchored class signatures Idea:  Compile / acquire all known lib versions but extract only the signatures, then compare against target binary Shouldn’t vary by compiler/build settings For a class C with methods M1, … , Mn, we define its anchored class signatureas: θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ For an archive A composed of classes C1,…,Ck, we define its anchored class signature as θ(A) = {θ(C1 ), ..., θ(Ck )} θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ θ(A) = {θ(C1 ), ..., θ(Ck )}
// This is **decompiled** source!! package a.b;  public class C extends java.lang.Object          implements g.h.I {      public C() {         // default constructor is inserted by javac     }     synchronized static int a (java.lang.String s)             throws a.b.E { 	     // decompiled byte code omitted     } } σ(C)     =  public class a.b.C extends Object implements I  σ(M1 ) =  public C() σ(M2 ) =  default synchronized static int a(String) throws E θ(C) = ⟨σ(C), ⟨σ(M1 ), σ(M2 )⟩⟩
Archive similarity We define the similarity index of two archives as their Jaccard coefficient: We define the inclusion index of two archives as:
Maven 2
Implementation Created byte code (bcel5) and source code signature extractors Used SHA1 hash for class signatures to improve performance We don’t care about near misses at the method or class level! Built corpus from Maven2 jar repository Maven is unversioned + volatile! 150 GB of jars, zips, tarballs, etc.,  130,000 binary jars (75,000 unique) 26M .class files, 4M .java source files (incl. duplicates) Archives contain archives:  75,000 classes are nested 4 levels deep!
An example Looking for cewolf-1.0 in Maven2
Investigation Target system:  An industrial e-commerce app containing 84 jars RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?
Investigation RQ1: How useful is the signature similarity index at finding the original binary archive for a given binary archive? 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0. 48 exact, 3 correct product (target version not in Maven) 20 / 84 we found multiple matches with simIndex= 1.0 19 exact, 1 correct product 12 / 84 we found matches with 0 < simIndex<1.0 1 exact, 9 correct product, 2 incorrect product 1 / 84 we found no match (product not in Maven as a binary) More data here:  http://juliusdavies.ca/uvic/jarchive/
Further uses Used the version info from extracted from e-commerce app to perform audits for licensing and security One jar changed open source licenses (GNU Affero, LGPL)  One jar version was found to have known security bugs When did Google Android developers copy-paste httpclient.jar classes into android.jar? And how much work would it be to include a newer version? We narrowed it down to two likely candidates, one of which turned out to be correct.
Summary Who are you? Determining the provenance of software entities is a growing and important problem Software Bertillonage: Quick and dirty techniques applied widely, then expensive techniques applied narrowly Identifying version IDs of included Java libraries is an example of the software provenance problem And our solution of anchored signature matching is an example of software Bertillonage
Production babies(born the week the paper was due) Uncle Mike and Lilia Biswas (who was born in Mike’s basement)  Dad Julius and Naoki
Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle

More Related Content

What's hot

Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Annibale Panichella
 
La préservation des logiciels: défis et opportunités pour la reproductibilité...
La préservation des logiciels: défis et opportunités pour la reproductibilité...La préservation des logiciels: défis et opportunités pour la reproductibilité...
La préservation des logiciels: défis et opportunités pour la reproductibilité...Roberto Di Cosmo
 
130614 sebastiano panichella - mining source code descriptions from develo...
130614   sebastiano panichella -  mining source code descriptions from develo...130614   sebastiano panichella -  mining source code descriptions from develo...
130614 sebastiano panichella - mining source code descriptions from develo...Ptidej Team
 
Reverse engineering android apps
Reverse engineering android appsReverse engineering android apps
Reverse engineering android appsPranay Airan
 
ICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code DescriptionsICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code DescriptionsSebastiano Panichella
 
A Method to Detect License Inconsistencies for Large-Scale Open Source Projects
A Method to Detect License Inconsistencies for Large-Scale Open Source ProjectsA Method to Detect License Inconsistencies for Large-Scale Open Source Projects
A Method to Detect License Inconsistencies for Large-Scale Open Source ProjectsYuhao Wu
 

What's hot (6)

Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...
 
La préservation des logiciels: défis et opportunités pour la reproductibilité...
La préservation des logiciels: défis et opportunités pour la reproductibilité...La préservation des logiciels: défis et opportunités pour la reproductibilité...
La préservation des logiciels: défis et opportunités pour la reproductibilité...
 
130614 sebastiano panichella - mining source code descriptions from develo...
130614   sebastiano panichella -  mining source code descriptions from develo...130614   sebastiano panichella -  mining source code descriptions from develo...
130614 sebastiano panichella - mining source code descriptions from develo...
 
Reverse engineering android apps
Reverse engineering android appsReverse engineering android apps
Reverse engineering android apps
 
ICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code DescriptionsICPC 2012 - Mining Source Code Descriptions
ICPC 2012 - Mining Source Code Descriptions
 
A Method to Detect License Inconsistencies for Large-Scale Open Source Projects
A Method to Detect License Inconsistencies for Large-Scale Open Source ProjectsA Method to Detect License Inconsistencies for Large-Scale Open Source Projects
A Method to Detect License Inconsistencies for Large-Scale Open Source Projects
 

Similar to Software Bertillonage: Finding the Provenance of an Entity

Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedYury Chemerkin
 
The Anatomy of Java Vulnerabilities (Devoxx UK 2017)
The Anatomy of Java Vulnerabilities (Devoxx UK 2017)The Anatomy of Java Vulnerabilities (Devoxx UK 2017)
The Anatomy of Java Vulnerabilities (Devoxx UK 2017)Steve Poole
 
The essence of the VivaCore code analysis library
The essence of the VivaCore code analysis libraryThe essence of the VivaCore code analysis library
The essence of the VivaCore code analysis libraryPVS-Studio
 
DEFCON 27 - ALEXANDRE BORGES - dot net malware threats
DEFCON 27 - ALEXANDRE BORGES - dot net malware threatsDEFCON 27 - ALEXANDRE BORGES - dot net malware threats
DEFCON 27 - ALEXANDRE BORGES - dot net malware threatsFelipe Prado
 
Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...
Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...
Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...Maksim Shudrak
 
Dark Side of iOS [SmartDevCon 2013]
Dark Side of iOS [SmartDevCon 2013]Dark Side of iOS [SmartDevCon 2013]
Dark Side of iOS [SmartDevCon 2013]Kuba Břečka
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...John Allspaw
 
The DevSecOps Builder’s Guide to the CI/CD Pipeline
The DevSecOps Builder’s Guide to the CI/CD PipelineThe DevSecOps Builder’s Guide to the CI/CD Pipeline
The DevSecOps Builder’s Guide to the CI/CD PipelineJames Wickett
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD PipelineJames Wickett
 
The Emergent Cloud Security Toolchain for CI/CD
The Emergent Cloud Security Toolchain for CI/CDThe Emergent Cloud Security Toolchain for CI/CD
The Emergent Cloud Security Toolchain for CI/CDJames Wickett
 
.NET MALWARE THREATS -- BHACK CONFERENCE 2019
.NET MALWARE THREATS -- BHACK CONFERENCE 2019.NET MALWARE THREATS -- BHACK CONFERENCE 2019
.NET MALWARE THREATS -- BHACK CONFERENCE 2019Alexandre Borges
 
IDA Vulnerabilities and Bug Bounty  by Masaaki Chida
IDA Vulnerabilities and Bug Bounty  by Masaaki ChidaIDA Vulnerabilities and Bug Bounty  by Masaaki Chida
IDA Vulnerabilities and Bug Bounty  by Masaaki ChidaCODE BLUE
 
Secure Your Open Source Projects For Free
Secure Your Open Source Projects For FreeSecure Your Open Source Projects For Free
Secure Your Open Source Projects For FreeDavide Benvegnù
 
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...Malachi Jones
 
Enterprise Java: Just What Is It and the Risks, Threats, and Exposures It Poses
Enterprise Java: Just What Is It and the Risks, Threats, and Exposures It PosesEnterprise Java: Just What Is It and the Risks, Threats, and Exposures It Poses
Enterprise Java: Just What Is It and the Risks, Threats, and Exposures It PosesAlex Senkevitch
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...IRJET Journal
 

Similar to Software Bertillonage: Finding the Provenance of an Entity (20)

JavaSecure
JavaSecureJavaSecure
JavaSecure
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
 
Core java part1
Core java  part1Core java  part1
Core java part1
 
The Anatomy of Java Vulnerabilities (Devoxx UK 2017)
The Anatomy of Java Vulnerabilities (Devoxx UK 2017)The Anatomy of Java Vulnerabilities (Devoxx UK 2017)
The Anatomy of Java Vulnerabilities (Devoxx UK 2017)
 
The essence of the VivaCore code analysis library
The essence of the VivaCore code analysis libraryThe essence of the VivaCore code analysis library
The essence of the VivaCore code analysis library
 
DEFCON 27 - ALEXANDRE BORGES - dot net malware threats
DEFCON 27 - ALEXANDRE BORGES - dot net malware threatsDEFCON 27 - ALEXANDRE BORGES - dot net malware threats
DEFCON 27 - ALEXANDRE BORGES - dot net malware threats
 
Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...
Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...
Tricky sample? Hack it easy! Applying dynamic binary inastrumentation to ligh...
 
Dark Side of iOS [SmartDevCon 2013]
Dark Side of iOS [SmartDevCon 2013]Dark Side of iOS [SmartDevCon 2013]
Dark Side of iOS [SmartDevCon 2013]
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
64 bit rugrats
64 bit rugrats64 bit rugrats
64 bit rugrats
 
The DevSecOps Builder’s Guide to the CI/CD Pipeline
The DevSecOps Builder’s Guide to the CI/CD PipelineThe DevSecOps Builder’s Guide to the CI/CD Pipeline
The DevSecOps Builder’s Guide to the CI/CD Pipeline
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD Pipeline
 
The Emergent Cloud Security Toolchain for CI/CD
The Emergent Cloud Security Toolchain for CI/CDThe Emergent Cloud Security Toolchain for CI/CD
The Emergent Cloud Security Toolchain for CI/CD
 
.NET MALWARE THREATS -- BHACK CONFERENCE 2019
.NET MALWARE THREATS -- BHACK CONFERENCE 2019.NET MALWARE THREATS -- BHACK CONFERENCE 2019
.NET MALWARE THREATS -- BHACK CONFERENCE 2019
 
IDA Vulnerabilities and Bug Bounty  by Masaaki Chida
IDA Vulnerabilities and Bug Bounty  by Masaaki ChidaIDA Vulnerabilities and Bug Bounty  by Masaaki Chida
IDA Vulnerabilities and Bug Bounty  by Masaaki Chida
 
Java basic
Java basicJava basic
Java basic
 
Secure Your Open Source Projects For Free
Secure Your Open Source Projects For FreeSecure Your Open Source Projects For Free
Secure Your Open Source Projects For Free
 
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...Automated In-memory Malware/Rootkit  Detection via Binary Analysis and Machin...
Automated In-memory Malware/Rootkit Detection via Binary Analysis and Machin...
 
Enterprise Java: Just What Is It and the Risks, Threats, and Exposures It Poses
Enterprise Java: Just What Is It and the Risks, Threats, and Exposures It PosesEnterprise Java: Just What Is It and the Risks, Threats, and Exposures It Poses
Enterprise Java: Just What Is It and the Risks, Threats, and Exposures It Poses
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Software Bertillonage: Finding the Provenance of an Entity

  • 1. Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle
  • 2.
  • 3. “Provenance” Originally, used for art / antiques, but now used in science and IT: Data provenance / audit trails Component provenance for security … but what about source code artifacts? A set of documentary evidence pertaining to the origin, history, or ownership of an artifact. [From “provenir”, French for “to come from”]
  • 4. Who are you? Alphonse Bertillon(1853-1914)
  • 5. Bertillonage metrics Height Stretch: Length of body from left shoulder to right middle finger when arm is raised Bust: Length of torso from head to seat, taken when seated Length of head: Crown to forehead Width of head: Temple to temple Length of right ear Length of left foot Length of left middle finger Length of left cubit: Elbow to tip of middle finger Width of cheeks
  • 6. Forensic Bertillonage “Quick and dirty”, and a giant step forward Can narrow huge set of candidate mugshots down to a small handful! Some problems: Equipment + training required Very sensitive to measurement error The metrics are not independent! … and fingerprints are much more accurate
  • 7. Software Bertillonage We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) Who are you, really? Where did you come from? Does your mother know you’re here? It’s not fingerprinting or DNA analysis! You may be looking for a cousin or ancestor A good software Bertillonage metric should: be computationally inexpensive be applicable to the desired level of granularity / prog. language catch most of the bad guys (recall) significantly reduce the search space (precision)
  • 10. A real-world problem Software packages often bundle in third-party libraries to avoid “DLL-hell”[Di Penta-10] In Java world, jars may include library source code or just byte code Included libs may include other libs too! Payment Card Industry Data Security Std (PCI-DSS), Req #6: “All critical systems must have the most recently released, appropriate software patches to protect against exploitation and compromise of cardholder data.” What if a financial software package doesn’t explicitly list the version IDs of its included libraries?
  • 11. Identifying included libraries The version ID may be embedded in the name of the component! e.g., commons-codec-1.1.jar … but often the version info is simply not there, or it’s wrong, …. Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10] Will return only product, not version Compare against all known compiled binaries But compilers, build-time compilation options may differ
  • 12. commons-codec-1.1.jar public class org.apache.commons.codec.binary.Base64 implements BinaryEncoder, BinaryDecoder public <init>() private static boolean isBase64(byte) public static boolean isArrayByteBase64(byte[]) public static byte[] encodeBase64(byte[]) public static byte[] encodeBase64Chunked(byte[]) public Object decode(Object) throws DecoderException public byte[] decode(byte[]) throws DecoderException public static byte[] encodeBase64(byte[],boolean) public static byte[] decodeBase64(byte[]) static byte[] discardWhitespace(byte[]) public Object encode(Object) throws EncoderException public byte[] encode(byte[]) throws EncoderException
  • 13. commons-codec-1.2.jar public class org.apache.commons.codec.binary.Base64 implements BinaryEncoder, BinaryDecoder public <init>() private static boolean isBase64(byte) public static boolean isArrayByteBase64(byte[]) public static byte[] encodeBase64(byte[]) public static byte[] encodeBase64Chunked(byte[]) public Object decode(Object) throws DecoderException public byte[] decode(byte[]) public static byte[] encodeBase64(byte[],boolean) public static byte[] decodeBase64(byte[]) static byte[] discardWhitespace(byte[]) static byte[] discardNonBase64(byte[]) public Object encode(Object) throws EncoderException public byte[] encode(byte[])
  • 14. Anchored class signatures Idea: Compile / acquire all known lib versions but extract only the signatures, then compare against target binary Shouldn’t vary by compiler/build settings For a class C with methods M1, … , Mn, we define its anchored class signatureas: θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ For an archive A composed of classes C1,…,Ck, we define its anchored class signature as θ(A) = {θ(C1 ), ..., θ(Ck )} θ(C) = ⟨σ(C), ⟨σ(M1), ..., σ(Mn)⟩⟩ θ(A) = {θ(C1 ), ..., θ(Ck )}
  • 15. // This is **decompiled** source!! package a.b; public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac } synchronized static int a (java.lang.String s) throws a.b.E { // decompiled byte code omitted } } σ(C) = public class a.b.C extends Object implements I σ(M1 ) = public C() σ(M2 ) = default synchronized static int a(String) throws E θ(C) = ⟨σ(C), ⟨σ(M1 ), σ(M2 )⟩⟩
  • 16. Archive similarity We define the similarity index of two archives as their Jaccard coefficient: We define the inclusion index of two archives as:
  • 18. Implementation Created byte code (bcel5) and source code signature extractors Used SHA1 hash for class signatures to improve performance We don’t care about near misses at the method or class level! Built corpus from Maven2 jar repository Maven is unversioned + volatile! 150 GB of jars, zips, tarballs, etc., 130,000 binary jars (75,000 unique) 26M .class files, 4M .java source files (incl. duplicates) Archives contain archives: 75,000 classes are nested 4 levels deep!
  • 19. An example Looking for cewolf-1.0 in Maven2
  • 20. Investigation Target system: An industrial e-commerce app containing 84 jars RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?
  • 21. Investigation RQ1: How useful is the signature similarity index at finding the original binary archive for a given binary archive? 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0. 48 exact, 3 correct product (target version not in Maven) 20 / 84 we found multiple matches with simIndex= 1.0 19 exact, 1 correct product 12 / 84 we found matches with 0 < simIndex<1.0 1 exact, 9 correct product, 2 incorrect product 1 / 84 we found no match (product not in Maven as a binary) More data here: http://juliusdavies.ca/uvic/jarchive/
  • 22. Further uses Used the version info from extracted from e-commerce app to perform audits for licensing and security One jar changed open source licenses (GNU Affero, LGPL) One jar version was found to have known security bugs When did Google Android developers copy-paste httpclient.jar classes into android.jar? And how much work would it be to include a newer version? We narrowed it down to two likely candidates, one of which turned out to be correct.
  • 23. Summary Who are you? Determining the provenance of software entities is a growing and important problem Software Bertillonage: Quick and dirty techniques applied widely, then expensive techniques applied narrowly Identifying version IDs of included Java libraries is an example of the software provenance problem And our solution of anchored signature matching is an example of software Bertillonage
  • 24. Production babies(born the week the paper was due) Uncle Mike and Lilia Biswas (who was born in Mike’s basement) Dad Julius and Naoki
  • 25. Software Bertillonage: Finding the Provenance of an Entity Julius Davies, Daniel M. German, Michael W. Godfrey, Abram Hindle
  • 26. Investigation RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? 22 / 84 source jars (26.2%), we found a single candidate from the corpus with similarity index of 1.0. 13 exact, 2 correct product 7 / 84 we found multiple matches with simIndex= 1.0 6 exact, 1 correct product 46 / 84 we found matches with 0 < simIndex < 1.0 25 exact, 20 correct product, 1 incorrect product 16 / 84 we found no match (product not in Maven as source) More data here: http://juliusdavies.ca/uvic/jarchive/
  • 27. Investigation RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive? Found exact match or correct product 81 / 84 times (96.4%) RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive? Found exact match or correct product 57 / 84 times (67.9%)
  • 28. Summary:Anchored signature matching If the master repository is rich enough, anchored signature matching can be highly accurate for both source and binary Tho might have to examine multiple candidates with perfect scores (median: 4, max: 30) Also works well if product present, but target version missing The approach is fast and simple, once the repository has been built
  • 29.
  • 30. Forensic Bertillonage Some problems … Equipment was cumbersome, expensive, required training Measurement error, consistency The metrics were not independent! Adoption (and later abandonment) … but overall it was a big success! Quick and dirty, and a huge leap forward Some training and tools required but could be performed with technology of late 1800s If done accurately, could quickly narrow down a very large pool of mugshots to only a handful
  • 31. Bertillonage desiderata A good Bertillonage metric should: be computationally inexpensive be applicable to the desired level of granularity / prog. language catch most of the bad guys (recall) significantly reduce the search space (precision) Why not “fingerprinting” or “DNA analysis”? Often there just is not enough info (or too much noise) to make conclusive identification So we hope to reduce the candidate set so that manual examination is feasible
  • 32. Software Bertillonage We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.) Who are you, really? Entity and relationship analysis Where did you come from? Evolutionary history Does your mother know you’re here? Licensing

Editor's Notes

  1. Diana and Actaeon by Titian has a full provenance covering its passage through several owners and four countries since it was painted for Philip II of Spain in the 1550s. Currently, it sits in the National Gallery, London (when not on tour).
  2. Invented a biometrics-based system for indexing criminal recordsAlso invented the mug shot, crime scene photography, forensic document analysis …
  3. You would think that the similarity index would be enough … but a lot of source archives contain test classes that don’t ship with the final product
  4. Maven2 repository acts as the Java community’s de facto library archive. The repository was originally developed as a place from where the Maven build system could download required libraries to build and compile an application. Thanks to its broad coverage and depth, many competing java build systems and dependency resolvers currently make use of it.
  5. For 51: 48 were “perfect” (correct product and version), 3 were correct product, but Maven did not have that versionFor 20: 19 contained perfect match in result set, 1 Maven did not have that version. Perfect match set was small for all but 2 or 3 where match set of over 30 candidates was produced
  6. For RQ1, when similarity index = 1.0, median number of cases to check was 4, max was 30
  7. Bertillon noticed that while everyone has eyes, nose, mouth, ears, etc, there are many variations on shapes and sizes. Perhaps this could be used as an organizing principle for mugshots. He also invented the modern mugshot, realizing that the combination of head-on and profile, done systematically, could aid in creating a useful photo repository.