SlideShare a Scribd company logo
Ctcompare:
 Comparing Multiple Code
   Trees for Similarity

       Warren Toomey
   School of IT, Bond University


    Using lexical analysis with
  techniques borrowed from DNA
sequencing, multiple code trees can
  be quickly compared to find any
          code similarities
Why Write a Code
          Comparison Tool?
    To detect student plagiarism
●




    To determine if your codebase is
●


    `infected' by proprietary code from
    elsewhere, or by open-source code
    covered by a license like the GPL

    To trace code genealogy between trees
●


    separated by time (e.g. versions), useful
    for the new field of computing history
Code Comparison Issues
  Can rearrangement of code be
●


  detected?
  - per line? per sub-line?
● Can “munging of code” be detected?


  - variable/function/struct renaming?
● What if one or both codebases are


  proprietary? How can third parties verify
  any comparison?
● Can a timely comparison be done?

● What is the rate of false positives?


  - of missed matches?
Lexical Comparison
  First version: break the source code
●


  from each tree into lexical tokens, then
  compare runs of tokens between trees.
● This removes all the code's semantics,


  but does deal with code rearrangement
  (but not code “munging”).
Performance of Lexical
            Approach

    Performance was O(M*N), when M and N
●


    the # of tokens in each code tree

    Basically: slow, and exponential
●




    I also wanted to compare multiple code
●


    trees, and also find duplicated code
    within a tree
When is Copying
            Significant?
    Common code fragments not significant:
●


                                    13 tokens
     for (i=0; i< val; i++)
                                    10 tokens
     if (x > max) max=x;
     err= stat(file, &sb); if (err) 14 tokens

    Axiom: runs of < 16 tokens insignificant
●




    Idea: use a run of 16 tokens as a lookup
●


    key to find other trees with the same run
    of tokens
Approach in Second
          Version
   Take all 16-tokens runs in a code tree
●

● Use each as a key, and insert the runs


  into a database along with position of
  the run in the source tree
● Keys (i.e. runs of 16 tokens) with


  multiple values indicate possible code
  copying of at least 16 tokens
● For these keys, we can evaluate all code


  trees from this point on to find any real
  code similarity
Approach in Second
     Version
Algorithm
for each key in the database with multiple record nodes {
  obtain the set of record nodes from the database;

 for each combination of node pairs Na and Nb {
  if (performing a cross-tree comparison)
    skip this combination if Na and Nb in the same tree;
  if (performing a code clone comparison)
    skip this combination if Na and Nb in the same file;

  perform a full token comparison beginning at Na and Nb to
  determine the full extent of the code similarity, i.e. determine
  the actual number of tokens in common;

  add the matching token sequence to a list of quot;potential runsquot;;
 }
}
walk the list of potential runs to merge overlapping runs, and
remove runs which are subsets of larger runs;

output the details of the actual matching token runs found.
Hashing Identifiers and
    Literals in Each Code Tree
    Simply comparing tokens yields false
●


    positives, e.g
      if (xx > yy) { aa = 2 * bb + cc; }
      if (ab > bc) {ab = 5 * ab + bc; }

    I hash each identifier and literal down to
●


    a 16-bit value

    This obfuscates the original values while
●


    minimising the amount of false positives
Isomorphic Comparison
   Hashing identifiers helps to ensure code
●


  similarities are found
● Does not help if variables have been


  renamed
● This can be solved with isomorphic


  comparison: keep a 2-way variable
  isomorphism table
● Code is isomorphic when variable


  names are isomorphic across a long run
  of code
Isomorphic Example
int maxofthree(int x, int y, int z)
 {
  if ((x>y) && (x>z)) return(x);
  if (y>z) return(y);
  return(z);
 }

int bigtriple(int b, int a, int c)
 {
  if ((b>a) && (b>c)) return(b);
  if (a>c) return(a);
  return(c);
 }
Bottlenecks
   In the second version, main bottleneck
●


  is the merging of overlapping code
  fragments which have been found to
  match between two trees
● Similar to the problem with DNA


  sequencing when genes are split up into
  multiple fragments before sequencing
● Use of AVL trees here has helped to


  reduce the cost of merging immensely
● Other more mundane optimisations


  such as mmap() have also helped
Current Performance




    54 seconds to compare 14 trees of code
●


    totalling 1 million lines of C code
Results
   Many lines of code written at UNSW in
●


  late 1970s still present in System V, and
  in Open Solaris
● Thousands of lines of replicated code


  inside Linux kernel
● AT&T vs BSDi lawsuit in early 1990s:


   - expert witness found 56 common lines
     in 230K lines between BSD and Unix
   - my tool find roughly same lines, plus
     100+ more isomorphic similarities
Question?

More Related Content

What's hot

Lzw compression
Lzw compressionLzw compression
Lzw compression
Meghna Singh
 
C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013
srikanthreddy004
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
emartinez.romero
 
Dictor
DictorDictor
50120130405006
5012013040500650120130405006
50120130405006
IAEME Publication
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
faizang909
 
Haxe vs Unicode (en)
Haxe vs Unicode (en)Haxe vs Unicode (en)
Haxe vs Unicode (en)
Ryusei Yamaguchi
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Codemotion
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware Aliasing
Peter Breuer
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
Pavan Babu .G
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3
HamesKellor
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
Bryan O'Sullivan
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with Variance
OriRoth
 
C language
C languageC language
C language
VAIRA MUTHU
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA
Rebaz Najeeb
 
LZ78
LZ78LZ78
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
MANISH T I
 
Computer Science Assignment Help
 Computer Science Assignment Help  Computer Science Assignment Help
Computer Science Assignment Help
Programming Homework Help
 
7
77

What's hot (19)

Lzw compression
Lzw compressionLzw compression
Lzw compression
 
C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013C faqs interview questions placement paper 2013
C faqs interview questions placement paper 2013
 
More Pointers and Arrays
More Pointers and ArraysMore Pointers and Arrays
More Pointers and Arrays
 
Dictor
DictorDictor
Dictor
 
50120130405006
5012013040500650120130405006
50120130405006
 
Compression project presentation
Compression project presentationCompression project presentation
Compression project presentation
 
Haxe vs Unicode (en)
Haxe vs Unicode (en)Haxe vs Unicode (en)
Haxe vs Unicode (en)
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Avoiding Hardware Aliasing
Avoiding Hardware AliasingAvoiding Hardware Aliasing
Avoiding Hardware Aliasing
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
 
CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3CMSC 350 HOMEWORK 3
CMSC 350 HOMEWORK 3
 
BayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore HaskellBayFP: Concurrent and Multicore Haskell
BayFP: Concurrent and Multicore Haskell
 
Study of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with VarianceStudy of the Subtyping Machine of Nominal Subtyping with Variance
Study of the Subtyping Machine of Nominal Subtyping with Variance
 
C language
C languageC language
C language
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA
 
LZ78
LZ78LZ78
LZ78
 
Lz77 (sliding window)
Lz77 (sliding window)Lz77 (sliding window)
Lz77 (sliding window)
 
Computer Science Assignment Help
 Computer Science Assignment Help  Computer Science Assignment Help
Computer Science Assignment Help
 
7
77
7
 

Viewers also liked

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMA
Rohit Varma
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Rohit Varma
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
guestfd118
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
guestfd118
 
Culture/Social Media
Culture/Social MediaCulture/Social Media
Culture/Social Media
Ben Komanapalli
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...
Rohit Varma
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV Guide
DoctorWkt
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003
Rohit Varma
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...
Rohit Varma
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software Archaeology
DoctorWkt
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...
Rohit Varma
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...
Rohit Varma
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineer
guestb541e
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to it
aliirfan04
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...
Rohit Varma
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specifications
aliirfan04
 
Lte security overview
Lte security overviewLte security overview
Lte security overview
aliirfan04
 

Viewers also liked (17)

The internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMAThe internet Business: Success stories, Business Models etc. - A talk at IIMA
The internet Business: Success stories, Business Models etc. - A talk at IIMA
 
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
Ten things that i have learnt in my marketing career_Talk at AICAR Business S...
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
المحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقىالمحاضرة الثالثه للدكتور صالح عراقى
المحاضرة الثالثه للدكتور صالح عراقى
 
Culture/Social Media
Culture/Social MediaCulture/Social Media
Culture/Social Media
 
Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...Less talked about aspects of digital marketing_The business and profession of...
Less talked about aspects of digital marketing_The business and profession of...
 
The Australian Community-Driven TV Guide
The Australian Community-Driven TV GuideThe Australian Community-Driven TV Guide
The Australian Community-Driven TV Guide
 
Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003Better Branding_McKinsey Quarterly_Nov 2003
Better Branding_McKinsey Quarterly_Nov 2003
 
Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...Learnings from handling social media products for the indian market an intern...
Learnings from handling social media products for the indian market an intern...
 
Reconstructive Software Archaeology
Reconstructive Software ArchaeologyReconstructive Software Archaeology
Reconstructive Software Archaeology
 
The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...The evolution of the Internet and it's impact on print and other media_IDC_II...
The evolution of the Internet and it's impact on print and other media_IDC_II...
 
Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...Ten things that are happening around us and ten ways India's students batch o...
Ten things that are happening around us and ten ways India's students batch o...
 
Steven Marks Engineer
Steven Marks EngineerSteven Marks Engineer
Steven Marks Engineer
 
How does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to itHow does the ue know when an eNB is talking to it
How does the ue know when an eNB is talking to it
 
Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...Social networking learnings & opportunities web innovation conference, bangal...
Social networking learnings & opportunities web innovation conference, bangal...
 
LTE and EPC Specifications
LTE and EPC SpecificationsLTE and EPC Specifications
LTE and EPC Specifications
 
Lte security overview
Lte security overviewLte security overview
Lte security overview
 

Similar to Ctcompare: Comparing Multiple Code Trees for Similarity

I1
I1I1
Parsing
ParsingParsing
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
Javed Ahmad
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
sonu sharma
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
Kgr Sushmitha
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
Jagan Mohan Bishoyi
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
Luis Galárraga
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
Frank Booth
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
Valerio Maggio
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Johan Tibell
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
Sujith Kumar
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
gadisaAdamu
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
Debanjan Bhattacharya
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
Information Security Awareness Group
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
Shauvik Roy Choudhary, Ph.D.
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
IJERA Editor
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
fnothaft
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
chidabdu
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
Anujashejwal
 

Similar to Ctcompare: Comparing Multiple Code Trees for Similarity (20)

I1
I1I1
I1
 
Parsing
ParsingParsing
Parsing
 
C Interview Questions for Fresher
C Interview Questions for FresherC Interview Questions for Fresher
C Interview Questions for Fresher
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
C interview Question and Answer
C interview Question and AnswerC interview Question and Answer
C interview Question and Answer
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Chapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materialsChapter-3 compiler.pptx course materials
Chapter-3 compiler.pptx course materials
 
cryptography and network security chap 3
cryptography and network security chap 3cryptography and network security chap 3
cryptography and network security chap 3
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
Bh36352357
Bh36352357Bh36352357
Bh36352357
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Recently uploaded

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 

Recently uploaded (20)

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 

Ctcompare: Comparing Multiple Code Trees for Similarity

  • 1. Ctcompare: Comparing Multiple Code Trees for Similarity Warren Toomey School of IT, Bond University Using lexical analysis with techniques borrowed from DNA sequencing, multiple code trees can be quickly compared to find any code similarities
  • 2. Why Write a Code Comparison Tool? To detect student plagiarism ● To determine if your codebase is ● `infected' by proprietary code from elsewhere, or by open-source code covered by a license like the GPL To trace code genealogy between trees ● separated by time (e.g. versions), useful for the new field of computing history
  • 3. Code Comparison Issues Can rearrangement of code be ● detected? - per line? per sub-line? ● Can “munging of code” be detected? - variable/function/struct renaming? ● What if one or both codebases are proprietary? How can third parties verify any comparison? ● Can a timely comparison be done? ● What is the rate of false positives? - of missed matches?
  • 4. Lexical Comparison First version: break the source code ● from each tree into lexical tokens, then compare runs of tokens between trees. ● This removes all the code's semantics, but does deal with code rearrangement (but not code “munging”).
  • 5. Performance of Lexical Approach Performance was O(M*N), when M and N ● the # of tokens in each code tree Basically: slow, and exponential ● I also wanted to compare multiple code ● trees, and also find duplicated code within a tree
  • 6. When is Copying Significant? Common code fragments not significant: ● 13 tokens for (i=0; i< val; i++) 10 tokens if (x > max) max=x; err= stat(file, &sb); if (err) 14 tokens Axiom: runs of < 16 tokens insignificant ● Idea: use a run of 16 tokens as a lookup ● key to find other trees with the same run of tokens
  • 7. Approach in Second Version Take all 16-tokens runs in a code tree ● ● Use each as a key, and insert the runs into a database along with position of the run in the source tree ● Keys (i.e. runs of 16 tokens) with multiple values indicate possible code copying of at least 16 tokens ● For these keys, we can evaluate all code trees from this point on to find any real code similarity
  • 9. Algorithm for each key in the database with multiple record nodes { obtain the set of record nodes from the database; for each combination of node pairs Na and Nb { if (performing a cross-tree comparison) skip this combination if Na and Nb in the same tree; if (performing a code clone comparison) skip this combination if Na and Nb in the same file; perform a full token comparison beginning at Na and Nb to determine the full extent of the code similarity, i.e. determine the actual number of tokens in common; add the matching token sequence to a list of quot;potential runsquot;; } } walk the list of potential runs to merge overlapping runs, and remove runs which are subsets of larger runs; output the details of the actual matching token runs found.
  • 10. Hashing Identifiers and Literals in Each Code Tree Simply comparing tokens yields false ● positives, e.g if (xx > yy) { aa = 2 * bb + cc; } if (ab > bc) {ab = 5 * ab + bc; } I hash each identifier and literal down to ● a 16-bit value This obfuscates the original values while ● minimising the amount of false positives
  • 11. Isomorphic Comparison Hashing identifiers helps to ensure code ● similarities are found ● Does not help if variables have been renamed ● This can be solved with isomorphic comparison: keep a 2-way variable isomorphism table ● Code is isomorphic when variable names are isomorphic across a long run of code
  • 12. Isomorphic Example int maxofthree(int x, int y, int z) { if ((x>y) && (x>z)) return(x); if (y>z) return(y); return(z); } int bigtriple(int b, int a, int c) { if ((b>a) && (b>c)) return(b); if (a>c) return(a); return(c); }
  • 13. Bottlenecks In the second version, main bottleneck ● is the merging of overlapping code fragments which have been found to match between two trees ● Similar to the problem with DNA sequencing when genes are split up into multiple fragments before sequencing ● Use of AVL trees here has helped to reduce the cost of merging immensely ● Other more mundane optimisations such as mmap() have also helped
  • 14. Current Performance 54 seconds to compare 14 trees of code ● totalling 1 million lines of C code
  • 15. Results Many lines of code written at UNSW in ● late 1970s still present in System V, and in Open Solaris ● Thousands of lines of replicated code inside Linux kernel ● AT&T vs BSDi lawsuit in early 1990s: - expert witness found 56 common lines in 230K lines between BSD and Unix - my tool find roughly same lines, plus 100+ more isomorphic similarities