Invalidating Copyright
Infringement Claims with
Python and Fuzzy
Hashing
Joe T. Sylve, M.S.

Managing Partner
504ENSICS La...
Background
• Client was being sued for Copyright Infringement
• Client’s lawyer wanted two questions answered
• Does the c...
Goal
• If it can be proven that the code contains open
source or GPL code with restrictive licenses then
the claim in inva...
Is code original?
• No comments or header’s that would imply
authorship
• Code didn’t look familiar
• Code was kind of cra...
Step 1 – Acquire Samples
• Wrote Python script to download all projects
written in PHP from Github
• Scraped from search f...
Step 2 – Compare Code
• Three Options
• Manual Verification
• Grad Students, Interns, etc

• Cryptographic Hashing
• MD5, ...
Fuzzy Hashing
• Vassil says I have to call it “Approximate Matching”
• Ssdeep
• Vassil Roussev & Candace Quates
• Free, Op...
When was code written?
• We can invalidate copyright if the sample on file
was written after the claimed authorship date
•...
PHP
• Web-based language
• Updated reasonably frequently
• New Features added often
• Goal
• Determine which features were...
Step 1 – Function Use
• Programmer can create own functions or use ones
available in the language
• Ex
• function plus_one...
Step 2 – Version Detection
• PHP comes with auto-generated documentation
about each built-in function
• Documentation says...
Step 3 – Date the code
• PHP has an archive of release notes on their
website
• Contains release versions and dates
• Pyth...
Step 4 – Profit
• Win!
• Code in question used features first available in
PHP 5.1.5
• Release date 17-Aug-2006
• This was...
Conclusion
• Sometimes you can’t depend solely on existing
tools
• Learn to program even if you’re not a
“programmer”
• PH...
Upcoming SlideShare
Loading in …5
×

Invalidating copyright infringement claims

203 views

Published on

How to invalidate copyright infringement claims by going into the code and locating the date.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
203
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Invalidating copyright infringement claims

  1. 1. Invalidating Copyright Infringement Claims with Python and Fuzzy Hashing Joe T. Sylve, M.S. Managing Partner 504ENSICS Labs
  2. 2. Background • Client was being sued for Copyright Infringement • Client’s lawyer wanted two questions answered • Does the code contain any open source or GPL code? • When was the code in question written? • Code was written in PHP (web-based application) • Code had absolutely no comments • No copyright headers • No dates of any kind www.504ensics.com
  3. 3. Goal • If it can be proven that the code contains open source or GPL code with restrictive licenses then the claim in invalid • If it can be proven that the copyright code on file was written after the author’s claimed “creation date”, Copyright is invalid www.504ensics.com
  4. 4. Is code original? • No comments or header’s that would imply authorship • Code didn’t look familiar • Code was kind of crappy www.504ensics.com
  5. 5. Step 1 – Acquire Samples • Wrote Python script to download all projects written in PHP from Github • Scraped from search feature • Limited to 50 pages of search • Got something like 10GB of compressed code • ~100,000 files www.504ensics.com
  6. 6. Step 2 – Compare Code • Three Options • Manual Verification • Grad Students, Interns, etc • Cryptographic Hashing • MD5, SHA-1, etc • “Fuzzy” Hashing • ssdeep, sdhash www.504ensics.com
  7. 7. Fuzzy Hashing • Vassil says I have to call it “Approximate Matching” • Ssdeep • Vassil Roussev & Candace Quates • Free, Open Source • Awesome • Traditional hashing • If a single bit of the input changes, the whole hash changes • Fuzzy Hashing • Compares files and gives similarity index • Can find “similar” files www.504ensics.com
  8. 8. When was code written? • We can invalidate copyright if the sample on file was written after the claimed authorship date • No comments or dates of any kind in the code! • No access to developer’s workstation to do traditional forensics • ??? www.504ensics.com
  9. 9. PHP • Web-based language • Updated reasonably frequently • New Features added often • Goal • Determine which features were used in the code • Correlate features with PHP release date • Code couldn’t have been written before this date www.504ensics.com
  10. 10. Step 1 – Function Use • Programmer can create own functions or use ones available in the language • Ex • function plus_one($x) { return $x + 1; } • Python script to find all function declarations and calls • Ignore declared functions • Left with a list of language “features” used www.504ensics.com
  11. 11. Step 2 – Version Detection • PHP comes with auto-generated documentation about each built-in function • Documentation says which version each function became first available • Write python script to scrape PHP documentation • Correlate functions with PHP versions • We only care about the function with the newest version www.504ensics.com
  12. 12. Step 3 – Date the code • PHP has an archive of release notes on their website • Contains release versions and dates • Python script scrapes release notes for the PHP version of interest and gives us the release date • Reasonably, the code couldn’t have been written before that date www.504ensics.com
  13. 13. Step 4 – Profit • Win! • Code in question used features first available in PHP 5.1.5 • Release date 17-Aug-2006 • This was after the claimed creation date www.504ensics.com
  14. 14. Conclusion • Sometimes you can’t depend solely on existing tools • Learn to program even if you’re not a “programmer” • PHP sucks • Fuzzy Hashing and Python is Cool www.504ensics.com

×