2. ▷ Introduction to ScanCode
○ Toolkit
○ App
▷ Demo
▷ More Details
▷ About nexB
3. Benefits of an open source
scanner
As a developer:
▷ I get normalized data for comprehensive origin and license
▷ I can find the license immediately when I evaluate a library
▷ I can identify and resolve license issues before a release
▷ I can identify issues for each commit
▷ I can communicate clearly with legal and business about license
and origin of third-party code
You can use the Apache-licensed ScanCode Toolkit now!
Participate by contributing code, license rules, bugs or suggestions.
4. What does ScanCode Toolkit do?
It scans source and binary code to find:
▷ License notices, texts and “mentions”
▷ Copyright notices
▷ Package-level information (RPM, nuget, NPM, Jar, etc.)
▷ Other provenance clues (author, email, etc.)
▷ File-level information (type, name, checksums, etc.)
5. ScanCode Results are
provided as:
▷ JSON file
▷ Dynamic HTML
▷ Static HTML table usable in a
spreadsheet
▷ AND
▷ ... the new ScanCode App
▷ ... next, in the ScanCode.io server
7. Available on GitHub
▷ Get the code
https://github.com/nexB/scancode-toolkit/
▷ Read more
https://github.com/nexB/scancode-toolkit/wiki
▷ Report an issue or idea
https://github.com/nexB/scancode-toolkit/issues
▷ Commercial support and services available
from nexB : ScanCode starter pack http:
//www.nexb.com/
8. ScanCode Licensing
License Notes
Software Apache 2.0
With an
acknowledgement in
the scan output.
Reference
Data
CC0 1.0 Public Domain
Third Party
Components
L/GPL, MIT, BSD, Apache Various Licenses
11. ScanCode App
Motivation:
▷ Analyze ScanCode results
▷ Document your conclusion about the
provenance and license for a software
component.
▷ Save conclusions
▷ Share results
14. Summary of Features
▷ View results in tree or tabular view
▷ Add conclusion data at any node of the
existing codebase hierarchy
▷ Save Components and conclusions to a
JSON file
16. Credits
Special thanks to all the people who made and released these
awesome free resources:
▷ Presentation template by SlidesCarnival
▷ Photographs by Unsplash
▷ And all the software authors who made ScanCode possible
17. About nexB Inc.
We offer:
▷ DejaCode™- Open Data Platform for Managing
Open Source - http://www.dejacode.com/
▷ Open Source Scanning & Tracking Tools - https:
//github.com/nexB
▷ Open Source Software Expert Audit Services -
http://www.nexb.com/services.html
19. Over 6,000 tests
Over 500 large software products scanned
Over 3,000 licenses, notices and samples
ScanCode by the numbers
20. ScanCode Toolkit- Technology
▷ Written primarily in Python
○ also JavaScript, Ruby, Java and C/C++
▷ Tested on Linux, OS X and Windows
▷ Command line tool or library
▷ Simple HTML browser-app (any modern
browser) - runs locally
21. ScanCode App - Technology
▷ Based on Electron and written primarily in
JavaScript
▷ D3.js used for data visualizations
22. What is Scanning?
Detect and discover “evidence” of origin and
license in code (source or binary files)
▷ Copyright notice
▷ License notice and/or license test
▷ Software package manifests
▷ Email, URL, author or other names
▷ Other origin and license clues found in the
code
23. Scanning is not Matching
Matching looks for similarities between your
code and an index (digital fingerprints) of OSS
code
▷ If your code is similar it “may” share a
similar origin
▷ Matching may be applied at multiple levels
○ Package
○ File or snippet
24. Scanning plus Matching
▷ Scanning will identify origin and license in
most cases, but
○ Does not detect copying of snippets, or
○ Intentional stripping of notices, etc.
▷ Matching can identify code that was copied
and/or stripped, but
○ Typically produces MANY false
positives and requires extensive review
○ Especially for the most commonly used
OSS projects
25. How does ScanCode work? (1)
▷ Each file is categorized based on its type
▷ Archives and compressed files are fully extracted
▷ The text of each file is collected (source and binaries)
▷ Each file's text is then "scanned"
▷ Results are formatted and returned as a JSON file
▷ You can view the results in a browser, or
▷ Use the JSON file as you want
26. How does ScanCode work? (2)
▷ For licenses, the techniques are similar to DNA
analysis with multi-pattern matching
▷ Licenses are found exactly or approximately based on
a set of thousands of license texts, notices and
examples
▷ For copyrights, a syntax and grammar analyzer
captures the many forms of copyright statements
▷ Emails, URLs, authors, person names and other data
are captured using similar pattern matching
techniques
27. Alternatives and complements
▷ Open source such as:
○ Fossology (c, PHP): regex-based
○ ninka (Perl): regex & sentences-based
○ OSLC (Java, unmaintained)
▷ Commercial
▷ Complementary:
○ AboutCode: document origin side-by-side with code,
collect inventory, generate attribution doc
○ TraceCode (not yet released): trace the source to
binary transformation to find (static) linking and
what is the subset of the source code used
(dynamically trace a build or does a static analysis)