How Component ID Methods Impact Open Source Risk Management Efficiency

WHITEPAPER
Snippets, Scans and Snap
Decisions:
how component identification
methods impact the efficiency of
open source risk management

Snippets, Scans and Snap Decisions: How Component Identification Methods Impact the Efficiency of Open Source Risk Management
Security, quality and licensing information is of little use if you haven’t precisely
identified the component you are using. And, without both accurate and
actionable component information, developers are not able to make the right
component selection from the start.
background
For the most part, modern software is assembled, not written. Approximately 80-90% of a
typical software application is comprised of third party components, most of which are open
source. Custom business logic comprises the remaining 10 percent.
This massive reliance on open source components has created a“software supply chain”and
new challenges for managing software security, quality and intellectual property. Organiza-
tions who rely on custom software are increasingly seeking visibility and control to manage
risk and maximize benefit.
But to properly manage open source components, you must know as much as possible about
them—starting with precisely identifying them. Security, quality and licensing information
is of little use if you haven’t precisely identified the component you are using. And, without
both accurate and actionable component information, developers are not able to make the
right component selection from the start.
Component identification is far from straightforward. There are numerous approaches used
to identify binary components. These include:
• Source Code Scanning
• Simple CPE Name Matching
• Simple Binary Matching
• Advanced Binary Fingerprinting

Source Code Scanning
Source code scanning technologies examine custom code and flag
potential matches against fingerprints of known open source code.
Source code scanning is a thorough and effective mechanism for
surfacing potential‘snippets’of open source code that might have
been inserted into otherwise proprietary work.
Source code scanning is, however, ineffective at precisely identify-
ing the binary component with which that code is associated. This
is because of the very nature of the component ecosystem. Com-
ponents are enormously complex; each one is made up of hun-
dreds of sub-assemblies (e.g. class files). Class files are commonly
shared among components. Of the nearly 200 million class files in
the Central Repository, there are fewer than 10 million unique class
files being combined in myriad ways.
Because of this highly commingled nature of the component
ecosystem, source code scanners are unable to precisely match
a ‘code snippet’ to a single binary component (and version). The
code scanner will often find dozens to hundreds of potential
matches, rendering the analysis of any security, quality or licens-
ing information futile.
Users of these technologies frequently report that they“receive
too many false positives”or that the scanners require“too much
manual research.” What they are really observing is that the tech-
nology itself is not well suited to the task of binary matching.
Source code scanning is
ineffective at precisely
identifying the binary
component with which
that code is associated.
Users of source code scanners or simple
binary matching frequently report that
they “receive too many false positives”
or that the scanners require “too much
manual research.”
Advanced binary matching is both fast
and precise. A large application can be
analyzed and a precise bill of materials
delivered in minutes.

Simple CPE Name Matching
This method first identifies CVE component vulnerability Group
ID and version information using the Common Platform Enumera-
tion (CPE) from the National Vulnerability Database (NVD). The
CPE is intended to identify vulnerable products and not vulner-
able artifacts. Using the CPE product and version to fuzzy match
Group:Artifact:Version results in over identification, or false posi-
tives, on the vulnerable artifact name. This approach works well
on applications that have an ideal set of vulnerable components
whose names align precisely with the CPE data or when there is
only one matching artifact for Group:*:Version.
For Example:
CVE-2014-0014
CPE: cpe:/a:apache:struts:1.3.8
Vulnerable GAV: org.apache:struts-core:1.3.8
CPE Name Matching:
org.apache.struts:struts-core:1.3.8 (True Positive)
org.apache.struts:struts-taglib:1.3.8 (False Positive)
org.apache.struts:struts-tiles:1.3.8 (False Positive)
Without proper data curation, the effectiveness of this approach is
limited to precision of the CPE classifications of the CVE data. CVE
data often does not identify the root cause (vulnerable artifact) of
the issue but identifies the platform the uses the vulnerable com-
ponent. Without research into each CVE to identify the vulnerable
artifact this method of matching will generate a high percentage
of false negatives. Also when the component resides in multiple
“authentic”repositories, you will get a false negative unless you
have all of the hashes from each repository.
Using the CPE
product and version
to fuzzy match
Group:Artifact:Version
results in over
identification, or
false positives, on the
vulnerable artifact
name.
Without research into
each CVE to identify the
vulnerable artifact this
method of matching
will generate a high
percentage of false
negatives.

Advanced Binary
Fingerprinting is able
to precisely identify
components even
when they have been
repackaged, rebuilt, or
otherwise altered.
Simple Binary Matching
Simple binary matching is used to match binaries based on crypto-
graphic hashes of inspected components against a known library
of component hashes. This allows precise identification of binary
components provided they have not been modified in any way.
Unfortunately, many users modify components in some way that
alters the identifying signatures (e.g. re- packaging, rebuilding,
combining components, removing unused classes, etc.). When
binaries are altered, simple binary matching fails and components
are not properly identified.
As with Simple CPE Name Matching, the effectiveness of this ap-
proach is limited to precision of the CPE classifications of the CVE
data. CVE data often does not identify the root cause (vulnerable ar-
tifact) of the issue and requires further manual research to be useful.
Users of these technologies often report“false negatives”or“misses.”
Also when the component resides in multiple“authentic”reposito-
ries, you will get a false negative unless you have all of the hashes
from each repository.
Advanced Binary Fingerprinting
To provide precise binary matching without the false positives of
source code scanning or the false negatives of simple binary match-
ing, Sonatype has invented a new, patent-pending method called
Advanced Binary Fingerprinting. With this method, Sonatype is able
to identify the unique combinations of subcomponents that are
uniquely identifiable as the specific version of a given component.
Advanced Binary Fingerprinting is able to precisely identify compo-
nents even when they have been repackaged, rebuilt, or otherwise
altered. This method allows proper assignment of security, quality
and licensing data to a specific component and version. Advanced
binary matching is both fast and precise. A large application can be
analyzed and a precise bill of materials delivered in minutes.

Types of Component
Matching
Description Pros Cons
Source Code
Scanning
This technology examines
custom code and flags
potential matches against
fingerprints of known open
source code.
Thorough and effective method
for identifying potential
“snippets”of open source code
that may have been inserted in
otherwise proprietary work.
Not effective at precisely
matching a code “snippet” with
a single binary component and
version. A source code scanner
may find dozens or hundreds
of potential matches requiring
additional manual research.
Simple CPE Name
Matching
This method first
identifies CVE component
vulnerability Group ID and
version information using
the Common Platform
Enumeration (CPE) from
the National Vulnerability
Database (NVD). The CPE
is intended to identify
vulnerable products and not
vulnerable artifacts.
This approach works well on
applications that have an ideal
set of vulnerable components
whose names align precisely
with the CPE data or when there
is only one matching artifact for
Group:*:Version.
The effectiveness of this
approach is limited to precision
of the CPE classifications of
the CVE data. CVE data often
does not identify the root cause
(vulnerable artifact) of the issue
but identifies the platform the
uses the vulnerable component.
Without research into each CVE
to identify the vulnerable artifact
this method of matching will
generate a high percentage of
false negatives.
Simple Binary
Matching
Simple binary matching
is used to match binaries
based on cryptographic
hashes of inspected
components against a
known library of component
hashes.
Allows precise identification of
binary components as long as
they have not been modified in
any way.
Many users modify components
in some way that alters the
identifying signatures (e.g.
re- packaging, rebuilding,
combining components,
removing unused classes, etc.).
When binaries are altered,
simple binary matching fails and
components are not properly
identified. Users report false
negatives or misses.
Advanced Binary
Fingerprinting
Patent-pending method
of identifying the
unique combinations
of subcomponents that
are unique to a specific
component version, as
well as other component
dependencies.
Precisely identifies a
component version with
known security or license
vulnerabilities regardless of
whether it has been modified.
Eliminates false negatives
and false positives to produce
accurate analyses very quickly.
Java and NuGet components
currently offered, but quickly
expanding to include javascript
and npm.

Sonatype Inc. • 8161 Maple Lawn Drive, Suite 250 • Fulton, MD 20759 • 1.877.866.2836 • www.sonatype.com
2015. Sonatype Inc. All Rights Reserved.
Sonatype helps organizations build better software, even faster. Like a traditional supply chain, software applications
are built by assembling open source and third party components streaming in from a wide variety of public and internal
sources. While re-use is far faster than custom code, the flow of components into and through an organization remains
complex and inefficient. Sonatype’s Nexus platform applies proven supply chain principles to increase speed, efficiency
and quality by optimizing the component supply chain. Sonatype has been on the forefront of creating tools to to
improve developer efficiency and quality since the inception of the Central Repository and Apache Maven in 2001, and
the company continues to serve as the steward of the Central Repository serving 17.2 Billion component download
requests in 2014 alone. Sonatype is privately held with investments from New Enterprise Associates (NEA), Accel Partners,
Bay Partners, Hummer Winblad Venture Partners and Morgenthaler Ventures. Visit: www.sonatype.com
For more information about Sonatype, visit www.sonatype.com
The Sonatype Advantage
Sonatype brings practical intelligence to compo-
nent-based software development. Sonatype offers
distinct advantages to organizations that need to
improve visibility and control over the open source
components that they use in software development.
• Sonatype pioneered component-based soft-
ware development as the creators of the
Apache Maven build system and the Nexus
repository manager.
• Sonatype is also the steward of the (Maven)
Central Repository, the industry’s primary
source for open source components, contain-
ing nearly 700,000 components and serving 17
billion requests last year.
• Sonatype is the only vendor to offer action-
able component security, quality, and licensing
information directly into the tools developers
use every day.
• Sonatype is the only vendor offering patent-
pending Advanced Binary Matching to quickly
and precisely identify components, even if they
have been altered or repackaged. This unique
technology minimizes“false positives”and
eliminates time-consuming manual research.
• Sonatype is the only vendor to provide real-
time update notifications to alert users when a
component they are using has been updated or
changed.
• Sonatype is the only vendor to analyze all
component dependencies, enabling quick
and precise identification of potential issues,
even if they’re nested deep within a complex
dependency tree.

How Component ID Methods Impact Open Source Risk Management Efficiency

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Component ID Methods Impact Open Source Risk Management Efficiency

Similar to How Component ID Methods Impact Open Source Risk Management Efficiency (20)

More from Sonatype

More from Sonatype (20)

Recently uploaded

Recently uploaded (20)

How Component ID Methods Impact Open Source Risk Management Efficiency