SlideShare a Scribd company logo
1 of 53
FooCodeChu
Services for software analysis, malware
detection, and vulnerability research

Silvio Cesare <silvio.cesare@gmail.com>
Who am I and why this talk?
• Ph.D. Student at Deakin University

• Book Author

• This talk covers some of my publically accessible
  Ph.D. research.
Introduction
• Research on software analysis, similarity, and
  classification
 ▫   Malware detection and attribution
 ▫   Incident response
 ▫   Plagiarism detection
 ▫   Software theft detection
 ▫   Vulnerability research

• Three academic research tools free to use on my
  website.
Outline
• Simseer

• Clonewise

• Bugwise

• Future Work and Conclusion
Software similarity and visualisation
Motivation
• Many applications of software similarity
  ▫ Malware detection
  ▫ Plagiarism detection
  ▫ Software theft detection

• Traditional string signatures are ineffective

• Modern fingerprints effective but in many case
  inefficient
Program Representation
    lea     0x4(%esp),%ecx
    and     $0xfffffff0,%esp                    Proc_0
    pushl   -0x4(%ecx)
    push    %ebp
    mov     %esp,%ebp
    push    %ecx
    sub     $0x24,%esp
    call    4011b0 <___main>
    movl    $0x0,-0x8(%ebp)
    jmp     40115f <_main+0x2f>
                                       Proc_1            Proc_3


             movl   $0x4020a0,(%esp)
             call   4011b8 <_puts>
             addl   $0x1,-0x8(%ebp)



    cmpl    $0x9,-0x8(%ebp)            Proc_4
    jle     40114f <_main+0x1f>




    add     $0x24,%esp
    pop     %ecx
    pop     %ebp                                Proc_2
    lea     -0x4(%ecx),%esp
    ret
Simseer Program Fingerprint
• Set of control flow graphs
• Many procedures
Decompilation of a Control Flow Graph
                                   proc(){
                     L_0           L_0:                   W|IEH}R
                                     while (v1 || v2) {
                     L_3           L_1:
                                       if (v3) {
       true                        L_2:
                     L_6
                                       } else {
              true                 L_4:
                                       }
       L_1           L_7           L_5:
                            true     }
       true                        L_7:
                                     return;
       L_2           L_4
                                   }
                     true

                     L_5
Q-Grams
• Input is decompiled strings

• Extract all possible fixed size substrings (q-
  grams)

• Train 500 dominant q-grams
                                           W|IE
                                           |IEH
                          W|IEH}R
                                           IEH}
                                           EH}R
Program Similarity
• 500 q-grams make a „feature vector‟

• Similarity using vector distance
Software similarity search
                           Query Benign

                                           r
                            q
           distance(p,q)

     p
                                          Query Malicious
         Query

         Malware
Future Work
• Give access to more classes of program
  „fingerprints‟
 ▫ Call graphs
 ▫ Opcodes
 ▫ Different similarity measures
Simseer summary
• Simseer is effective

• Efficient

• Web service is free for public use
Detecting package clones and inferring security problems
Motivation
• Developers may “embed” or “clone” software
  from 3rd party sources
 ▫ Maintaining an internal copy of a library
 ▫ Forking a library

• Clonewise detects if two packages share code
• And if one package is entirely embedded in
  another.
                                       Firefox Vulnerabilities
                                                                 libpng Vulnerabilities
Feature Extraction – Shared package
clone detection
                1.    N_Filenames_A
                2.    N_Filenames_Source_A
                3.    N_Filenames_B
                4.    N_Filenames_Source_B
                5.    N_Common_Filenames
                6.    N_Common_Similar_Filenames
                7.    N_Common_FilenameHashes
                8.    N_Common_FilenameHash80
                9.    N_Common_ExactFilenameHash
                10.   N_Score_of_Common_Filename
                11.   N_Score_of_Common_Similar_Filename
                12.   N_Score_of_Common_FilenameHash
                13.   N_Score_of_Common_FilenameHash80
                14.   N_Score_of_Common_ExactFilenameHash80
                15.   N_Data_Common_Filenames
                16.   N_Data_Common_Similar_Filenames
                17.   N_Data_Common_FilenameHashes
                18.   N_Data_Common_FilenameHash80
                19.   N_Data_Common_ExactFilenameHash
                20.   N_Data_Score_of_Common_Filename
                21.   N_Data_Score_of_Common_Similar_Filename
                22.   N_Data_Score_of_Common_FilenameHash
                23.   N_Data_Score_of_Common_FilenameHash80
                24.   N_Data_Score_of_Common_ExactFilenameHash80
                25.   N_Common_ExactHash
                26.   N_Common_DataExactHash
Classification
• Consider feature vectors as n-dimensional
  points in space.

• Linear classifiers

• Non-linear classifiers

• Decision trees
                            Class A

                            Class B
Feature Extraction – Embedded clone
detection
               1. N_Filenames_A
               2. N_Filenames_Source_A
               3. N_Filenames_B
               4. N_Filenames_Source_B
               5. Percent_Match_In_A
               6. Percent_Data_Match_In_A
               7. Percent_Match_In_B
               8. Percent_Data_Match_In_B
               9. Percent_Score_In_A
               10.Percent_Data_Score_In_A
               11.Percent_Score_In_B
               12.Percent_Data_Score_In_B
               13.A_Has_Lib_In_Name
               14.B_Has_Lib_In_Name
               15.A_To_B_Ratio
               16.A_To_B_Data_Ratio
               17.N_Dependents_A
               18.N_Dependents_B
Detecting copyright violations

1. Identify embedded package clones.
2. Extract license information of each package.
3. For each GPL licensed embedded package
   clone:
  ▫ Verify that the package it is embedded in is
     not licensing it under a permissive license.
Automated Vulnerability Inference
1. Take CVE, match CPE name to Debian package.

2. Parse CVE summary and extract vuln filename.

3. Find clones of package with similar filename.

4. Trim dynamically linked clones.

5. Is vuln affected clone already being tracked?
Package clone detection use-case
Finding Vulnerabilities
Shared package clone evaluation

 Classifier      TP/FN     FP/TN       TP Rate   FP Rate

 Naïve Bayes     439/322   484/56296   57.69%    0.85%
 Multilayer
 Perceptron      204/557   48/56732    26.81%    0.08%

 C4.5            523/238   86/56694    68.73%    0.15%

 Random Forest   533/228   60/56720    70.04%    0.11%
 Random Forest
 (0.8)           446/315   15/56765    58.61%    0.03%
Embedded clone detection evaluation

 Classifier      TP/FN     FP/TN       TP Rate   FP Rate

 Naïve Bayes     718/43    6341/2808   94.35%    69.31%
 Multilayer
 Perceptron      328/433   108/9041    43.10%    1.18%

 C4.5            572/189   69/9080     75.16%    0.75%

 Random Forest   554/207   68/9081     72.80%    0.74%
 Asymmetric
 Bagging         699/62    615/8534    91.86%    6.72%
Automatic detection of suspicious
clones

    PACKAGE           EMBEDDED PACKAGE
    freevo            feedparser
    hedgewars         freetype
    ia32-libs         *
    libtk-img         tiff
    likewise-open     curl
    luatex            poppler
    planet-venus      feedparser
    syslinux          libpng
    vnc4              freetype
    vtk               tiff
Future Work
• Binary-level clone detection

• Integrate into Linux distributions

• Linux security teams usage
Clonewise summary
• Practical clone detection in Linux

• Improves manual only tracking

• Has found bugs

• Debian Linux want to integrate it into infrastructure

• Open source project

• Web service to perform clone detection
Detecting bugs in binaries using decompilation and data flow
analysis
Motivation
• Detecting bugs in binary is useful
 ▫   Black-box penetration testing
 ▫   External audits and compliance
 ▫   Quality assurance of 3rd party software
 ▫   Verification of compilation and linkage
Wire – A formal language for binary
analysis
• x86 is complex and big

• Wire is a low level RISC assembly style language

• Translated from x86

• Formally defined operational semantics



                  The LOAD instruction implements a memory read.
Stack Pointer Inference
• Proposed in HexRays decompiler -
  http://www.hexblog.com/?p=42

• Estimate Stack Pointer (SP) in and out of basic block
  ▫ By tracking and estimating SP modifications using linear
    inequalities
• Solve.




Picture from HexRays blog   .
Decompilation - Local Variable
 Recovery
  • Based on stack pointer inference
  • Access to memory offset to the stack
  • Replace with native Wire register
Imark     ($0x80483f5, , )
AddImm32 (%esp(4), $0x1c, %temp_memreg(12c))
LoadMem32 (%temp_memreg(12c), , %temp_op1d(66))
                                                      Imark   ($0x80483f5, , )
Imark     ($0x80483f9, , )
                                                      Imark   ($0x80483f9, , )
StoreMem32(%temp_op1d(66), , %esp(4))                Imark   ($0x80483fc, , )
Imark     ($0x80483fc, , )
                                                      Free    (%local_28(186bc), , )
SubImm32 (%esp(4), $0x4, %esp(4))
LoadImm32 ($0x80483fc, , %temp_op1d(66))
StoreMem32(%temp_op1d(66), , %esp(4))
Lcall     (, , $0x80482f0)
Data Flow Analysis - Reaching
Definitions
• A reaching definition is a definition of a variable
  that reaches a program point without being
  redefined.
                                               X=1
                                               Y=3



                                         X>2          X <=2



                               X=2
                                                              Print(X)
                              Print(X)




                                                     Y=3, X=1, and X=2 are
                              Print(X)
                                                      reaching definitions
More data flow problems
• Upward Exposed Uses
  ▫ All uses of a definition

• Live Variables
  ▫ A variable is live if it will be subsequently read
    without being redefined.

• Reaching Copies
  ▫ The reach of a copy statement

• etc
getenv() bugs
• Detect unsafe applications of getenv()
• Example: strcpy(buf,getenv(“HOME”))
• For each getenv()
 ▫ If return value is live
 ▫ And it‟s the reaching definition to the 2nd
   argument to strcpy()
 ▫ Then warn

• P.S. 2001 wants its bugs back.
Use-after-free Detection
• For each free(ptr)
 ▫ If ptr live
                       void f(int x)
 ▫ Then warn           {
                              int *p = malloc(10);
                              dowork(p);
                              free(p);
                              if (x)
                                     p[0] = 1;
                       }
Double Free Detection
• For each free(ptr)
 ▫ If an upward exposed use of ptr‟s definition is
   free(ptr)
                       void f(int x)
 ▫ Then warn           {
                               int *p = malloc(10);
                               dowork(p);
                               free(p);
                               if (x)
                                      free(p);
• 2001 calls again      }
getenv() bugs
•   Scanned entire Debian 7 unstable repository
•   ~123,000 ELF binaries
                            4digits                    ptop
                            acedb-other-belvu          recordmydesktop
                            acedb-other-dotter         rlplot
                            bvi                        sapphire

•   85 bug reports          comgt
                            csmash
                                                       sc
                                                       scm
                            elvis-tiny                 sgrep

•   47 packages             fvwm
                            garmin-ant-downloader
                                                       slurm-llnl-slurmdbd
                                                       statserial
                            gcin                       stopmotion
                            gexec                      supertransball2
                            gmorgan                    theorur
                            gopher                     twpsk
                            gsoko                      udo
                            gstm                       vnc4server
                            hime                       wily
                            le-dico-de-rene-cougnenc   wmpinboard
                            libreoffice-dev            wmppp.app
                            libxgks-dev                xboing
                            lie                        xemacs21-bin
                            lpe                        xjdic
                            mp3rename                  xmotd
                            mpich-mpd-bin
                            open-cobol
                            procmail
getenv() bugs over time –
sorted by binary size
• Linear or power growth?
getenv() bug statistics
• Probability (P) of a binary being vulnerable: 0.00067

• P. of a package being vulnerable: 0.00255

                             P( A   B)
                 P( A | B)
                               P(B)

   Conditional probability of A given that B has occurred:

• P. of a package having a 2nd vulnerability given that one
  binary in the package is vulnerable: 0.52380
Double free in SGID games “xonix”
  memset(score_rec[i].login, 0, 11);
  strncpy(score_rec[i].login, pw->pw_name, 10);
  memset(score_rec[i].full, 0, 65);
  strncpy(score_rec[i].full, fullname, 64);
  score_rec[i].tstamp = time(NULL);

  free(fullname);


  if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) {
      fprintf(stderr, "xonix: cannot reopen high score filen");

      free(fullname);
      gameover_pending = 0;
      return;
  }
Future Work
• Core
 ▫   Summary-based interprocedural analysis
 ▫   Context sensitive interprocedural analysis
 ▫   Pointer analysis
 ▫   Improved decompilation
• More bug classes
Bugwise summary
• Practical tool to find simple bugs

• Based on strong theory

• Extensible

• Much work to do in the future

• Web service free to use
Future Work
• Make more of my research public

• Provide better backend infrastructure

• Get people to use the services!
Conclusion
• All of the tools in this talk are for public use

• http://www.FooCodeChu.com

  ▫ Wiki on software similarity and classification

  ▫ Preprint of my book available

• Buy my book from Springer

More Related Content

What's hot

FUNDAMENTALS OF PYTHON LANGUAGE
 FUNDAMENTALS OF PYTHON LANGUAGE  FUNDAMENTALS OF PYTHON LANGUAGE
FUNDAMENTALS OF PYTHON LANGUAGE Saraswathi Murugan
 
php&mysql with Ethical Hacking
php&mysql with Ethical Hackingphp&mysql with Ethical Hacking
php&mysql with Ethical HackingBCET
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized PatentsReport on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized PatentsMike Salampasis
 
Bestiary of Functional Programming with Cats
Bestiary of Functional Programming with CatsBestiary of Functional Programming with Cats
Bestiary of Functional Programming with Cats💡 Tomasz Kogut
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsDavid Beazley (Dabeaz LLC)
 
PHP 8: Process & Fixing Insanity
PHP 8: Process & Fixing InsanityPHP 8: Process & Fixing Insanity
PHP 8: Process & Fixing InsanityGeorgePeterBanyard
 
Ry pyconjp2015 turtle
Ry pyconjp2015 turtleRy pyconjp2015 turtle
Ry pyconjp2015 turtleRenyuan Lyu
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R PackagesRsquared Academy
 
Python programming Workshop SITTTR - Kalamassery
Python programming Workshop SITTTR - KalamasseryPython programming Workshop SITTTR - Kalamassery
Python programming Workshop SITTTR - KalamasserySHAMJITH KM
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesChunhua Liao
 
OpenGurukul : Language : Python
OpenGurukul : Language : PythonOpenGurukul : Language : Python
OpenGurukul : Language : PythonOpen Gurukul
 
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...Edureka!
 

What's hot (20)

Pythonppt28 11-18
Pythonppt28 11-18Pythonppt28 11-18
Pythonppt28 11-18
 
FUNDAMENTALS OF PYTHON LANGUAGE
 FUNDAMENTALS OF PYTHON LANGUAGE  FUNDAMENTALS OF PYTHON LANGUAGE
FUNDAMENTALS OF PYTHON LANGUAGE
 
Clojure 7-Languages
Clojure 7-LanguagesClojure 7-Languages
Clojure 7-Languages
 
Biopython
BiopythonBiopython
Biopython
 
Spsl iv unit final
Spsl iv unit  finalSpsl iv unit  final
Spsl iv unit final
 
php&mysql with Ethical Hacking
php&mysql with Ethical Hackingphp&mysql with Ethical Hacking
php&mysql with Ethical Hacking
 
C language
C languageC language
C language
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized PatentsReport on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
 
Bestiary of Functional Programming with Cats
Bestiary of Functional Programming with CatsBestiary of Functional Programming with Cats
Bestiary of Functional Programming with Cats
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
 
PHP 8: Process & Fixing Insanity
PHP 8: Process & Fixing InsanityPHP 8: Process & Fixing Insanity
PHP 8: Process & Fixing Insanity
 
Ry pyconjp2015 turtle
Ry pyconjp2015 turtleRy pyconjp2015 turtle
Ry pyconjp2015 turtle
 
python.ppt
python.pptpython.ppt
python.ppt
 
R Programming: Introduction To R Packages
R Programming: Introduction To R PackagesR Programming: Introduction To R Packages
R Programming: Introduction To R Packages
 
Python programming Workshop SITTTR - Kalamassery
Python programming Workshop SITTTR - KalamasseryPython programming Workshop SITTTR - Kalamassery
Python programming Workshop SITTTR - Kalamassery
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC Challenges
 
OpenGurukul : Language : Python
OpenGurukul : Language : PythonOpenGurukul : Language : Python
OpenGurukul : Language : Python
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
 

Viewers also liked

Sourcefire Vulnerability Research Team Labs
Sourcefire Vulnerability Research Team LabsSourcefire Vulnerability Research Team Labs
Sourcefire Vulnerability Research Team Labslosalamos
 
Challenges in High Accuracy of Malware Detection
Challenges in High Accuracy of Malware DetectionChallenges in High Accuracy of Malware Detection
Challenges in High Accuracy of Malware DetectionMuhammad Najmi Ahmad Zabidi
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationUltraUploader
 
Zero Day Malware Detection/Prevention Using Open Source Software
Zero Day Malware Detection/Prevention Using Open Source SoftwareZero Day Malware Detection/Prevention Using Open Source Software
Zero Day Malware Detection/Prevention Using Open Source SoftwareMyNOG
 
Ensembled Based Categorization and Adaptive Learning Model for Malware Detection
Ensembled Based Categorization and Adaptive Learning Model for Malware DetectionEnsembled Based Categorization and Adaptive Learning Model for Malware Detection
Ensembled Based Categorization and Adaptive Learning Model for Malware DetectionMuhammad Najmi Ahmad Zabidi
 
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...Carlos Laorden
 
Next Generation Advanced Malware Detection and Defense
Next Generation Advanced Malware Detection and DefenseNext Generation Advanced Malware Detection and Defense
Next Generation Advanced Malware Detection and DefenseLuca Simonelli
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveChong-Kuan Chen
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesArshadRaja786
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware DetectionVMware Tanzu
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 

Viewers also liked (15)

Sourcefire Vulnerability Research Team Labs
Sourcefire Vulnerability Research Team LabsSourcefire Vulnerability Research Team Labs
Sourcefire Vulnerability Research Team Labs
 
Challenges in High Accuracy of Malware Detection
Challenges in High Accuracy of Malware DetectionChallenges in High Accuracy of Malware Detection
Challenges in High Accuracy of Malware Detection
 
Applications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creationApplications of genetic algorithms to malware detection and creation
Applications of genetic algorithms to malware detection and creation
 
Zero Day Malware Detection/Prevention Using Open Source Software
Zero Day Malware Detection/Prevention Using Open Source SoftwareZero Day Malware Detection/Prevention Using Open Source Software
Zero Day Malware Detection/Prevention Using Open Source Software
 
Seminar
SeminarSeminar
Seminar
 
Malware Detection With Multiple Features
Malware Detection With Multiple FeaturesMalware Detection With Multiple Features
Malware Detection With Multiple Features
 
Ensembled Based Categorization and Adaptive Learning Model for Malware Detection
Ensembled Based Categorization and Adaptive Learning Model for Malware DetectionEnsembled Based Categorization and Adaptive Learning Model for Malware Detection
Ensembled Based Categorization and Adaptive Learning Model for Malware Detection
 
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
 
Next Generation Advanced Malware Detection and Defense
Next Generation Advanced Malware Detection and DefenseNext Generation Advanced Malware Detection and Defense
Next Generation Advanced Malware Detection and Defense
 
Malware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning PerspectiveMalware Detection - A Machine Learning Perspective
Malware Detection - A Machine Learning Perspective
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Malware Detection using Machine Learning
Malware Detection using Machine Learning	Malware Detection using Machine Learning
Malware Detection using Machine Learning
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware Detection
 
Model-checking for efficient malware detection
Model-checking for efficient malware detectionModel-checking for efficient malware detection
Model-checking for efficient malware detection
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Similar to FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research

Cancer genomics big_datascience_meetup_july_14_2014
Cancer genomics big_datascience_meetup_july_14_2014Cancer genomics big_datascience_meetup_july_14_2014
Cancer genomics big_datascience_meetup_july_14_2014Shyam Sarkar
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySilvio Cesare
 
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in Firmware
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in FirmwareUsing Static Binary Analysis To Find Vulnerabilities And Backdoors in Firmware
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in FirmwareLastline, Inc.
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonRalf Gommers
 
Instrumentation & the Pitfalls of Abstraction
Instrumentation & the Pitfalls of AbstractionInstrumentation & the Pitfalls of Abstraction
Instrumentation & the Pitfalls of AbstractionESUG
 
DEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tips
DEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tipsDEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tips
DEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tipsFelipe Prado
 
BSides IR in Heterogeneous Environment
BSides IR in Heterogeneous EnvironmentBSides IR in Heterogeneous Environment
BSides IR in Heterogeneous EnvironmentStefano Maccaglia
 
PVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applicationsPVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applicationsPVS-Studio
 
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT TalksMykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT TalksVadym Muliavka
 
Angular Weekend
Angular WeekendAngular Weekend
Angular WeekendTroy Miles
 
Code lifecycle in the jvm - TopConf Linz
Code lifecycle in the jvm - TopConf LinzCode lifecycle in the jvm - TopConf Linz
Code lifecycle in the jvm - TopConf LinzIvan Krylov
 
Java SpringMVC SpringBOOT (Divergent).ppt
Java SpringMVC SpringBOOT (Divergent).pptJava SpringMVC SpringBOOT (Divergent).ppt
Java SpringMVC SpringBOOT (Divergent).pptAayush Chimaniya
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldAndrey Karpov
 
Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)Kiran Jonnalagadda
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopJason Trost
 
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning TalksBuilding Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning TalksAtlassian
 
Search for Vulnerabilities Using Static Code Analysis
Search for Vulnerabilities Using Static Code AnalysisSearch for Vulnerabilities Using Static Code Analysis
Search for Vulnerabilities Using Static Code AnalysisAndrey Karpov
 
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...Ioannis Stais
 

Similar to FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research (20)

Fuzzing - Part 1
Fuzzing - Part 1Fuzzing - Part 1
Fuzzing - Part 1
 
Cancer genomics big_datascience_meetup_july_14_2014
Cancer genomics big_datascience_meetup_july_14_2014Cancer genomics big_datascience_meetup_july_14_2014
Cancer genomics big_datascience_meetup_july_14_2014
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
 
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in Firmware
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in FirmwareUsing Static Binary Analysis To Find Vulnerabilities And Backdoors in Firmware
Using Static Binary Analysis To Find Vulnerabilities And Backdoors in Firmware
 
Standardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for PythonStandardizing on a single N-dimensional array API for Python
Standardizing on a single N-dimensional array API for Python
 
Instrumentation & the Pitfalls of Abstraction
Instrumentation & the Pitfalls of AbstractionInstrumentation & the Pitfalls of Abstraction
Instrumentation & the Pitfalls of Abstraction
 
DEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tips
DEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tipsDEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tips
DEF CON 27 - DIMITRY SNEZHKOV - zombie ant farm practical tips
 
BSides IR in Heterogeneous Environment
BSides IR in Heterogeneous EnvironmentBSides IR in Heterogeneous Environment
BSides IR in Heterogeneous Environment
 
PVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applicationsPVS-Studio, a solution for developers of modern resource-intensive applications
PVS-Studio, a solution for developers of modern resource-intensive applications
 
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT TalksMykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
Mykhailo Zarai "Be careful when dealing with C++" at Rivne IT Talks
 
Angular Weekend
Angular WeekendAngular Weekend
Angular Weekend
 
Code lifecycle in the jvm - TopConf Linz
Code lifecycle in the jvm - TopConf LinzCode lifecycle in the jvm - TopConf Linz
Code lifecycle in the jvm - TopConf Linz
 
Java SpringMVC SpringBOOT (Divergent).ppt
Java SpringMVC SpringBOOT (Divergent).pptJava SpringMVC SpringBOOT (Divergent).ppt
Java SpringMVC SpringBOOT (Divergent).ppt
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security world
 
Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in Hadoop
 
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning TalksBuilding Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
 
Search for Vulnerabilities Using Static Code Analysis
Search for Vulnerabilities Using Static Code AnalysisSearch for Vulnerabilities Using Static Code Analysis
Search for Vulnerabilities Using Static Code Analysis
 
SEppt
SEpptSEppt
SEppt
 
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
 

More from Silvio Cesare

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGSilvio Cesare
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSSilvio Cesare
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Silvio Cesare
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisSilvio Cesare
 
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionEffective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionSilvio Cesare
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSilvio Cesare
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationSilvio Cesare
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxSilvio Cesare
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Silvio Cesare
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSilvio Cesare
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareSilvio Cesare
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowSilvio Cesare
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...Silvio Cesare
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For EmulationSilvio Cesare
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource KernelsSilvio Cesare
 

More from Silvio Cesare (16)

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
 
Effective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detectionEffective flowgraph-based malware variant detection
Effective flowgraph-based malware variant detection
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
 

FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research

  • 1. FooCodeChu Services for software analysis, malware detection, and vulnerability research Silvio Cesare <silvio.cesare@gmail.com>
  • 2. Who am I and why this talk? • Ph.D. Student at Deakin University • Book Author • This talk covers some of my publically accessible Ph.D. research.
  • 3. Introduction • Research on software analysis, similarity, and classification ▫ Malware detection and attribution ▫ Incident response ▫ Plagiarism detection ▫ Software theft detection ▫ Vulnerability research • Three academic research tools free to use on my website.
  • 4. Outline • Simseer • Clonewise • Bugwise • Future Work and Conclusion
  • 5. Software similarity and visualisation
  • 6. Motivation • Many applications of software similarity ▫ Malware detection ▫ Plagiarism detection ▫ Software theft detection • Traditional string signatures are ineffective • Modern fingerprints effective but in many case inefficient
  • 7. Program Representation lea 0x4(%esp),%ecx and $0xfffffff0,%esp Proc_0 pushl -0x4(%ecx) push %ebp mov %esp,%ebp push %ecx sub $0x24,%esp call 4011b0 <___main> movl $0x0,-0x8(%ebp) jmp 40115f <_main+0x2f> Proc_1 Proc_3 movl $0x4020a0,(%esp) call 4011b8 <_puts> addl $0x1,-0x8(%ebp) cmpl $0x9,-0x8(%ebp) Proc_4 jle 40114f <_main+0x1f> add $0x24,%esp pop %ecx pop %ebp Proc_2 lea -0x4(%ecx),%esp ret
  • 8. Simseer Program Fingerprint • Set of control flow graphs • Many procedures
  • 9.
  • 10. Decompilation of a Control Flow Graph proc(){ L_0 L_0: W|IEH}R while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: } L_1 L_7 L_5: true } true L_7: return; L_2 L_4 } true L_5
  • 11. Q-Grams • Input is decompiled strings • Extract all possible fixed size substrings (q- grams) • Train 500 dominant q-grams W|IE |IEH W|IEH}R IEH} EH}R
  • 12. Program Similarity • 500 q-grams make a „feature vector‟ • Similarity using vector distance
  • 13. Software similarity search Query Benign r q distance(p,q) p Query Malicious Query Malware
  • 14.
  • 15.
  • 16.
  • 17. Future Work • Give access to more classes of program „fingerprints‟ ▫ Call graphs ▫ Opcodes ▫ Different similarity measures
  • 18. Simseer summary • Simseer is effective • Efficient • Web service is free for public use
  • 19. Detecting package clones and inferring security problems
  • 20. Motivation • Developers may “embed” or “clone” software from 3rd party sources ▫ Maintaining an internal copy of a library ▫ Forking a library • Clonewise detects if two packages share code • And if one package is entirely embedded in another. Firefox Vulnerabilities libpng Vulnerabilities
  • 21. Feature Extraction – Shared package clone detection 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. N_Common_Filenames 6. N_Common_Similar_Filenames 7. N_Common_FilenameHashes 8. N_Common_FilenameHash80 9. N_Common_ExactFilenameHash 10. N_Score_of_Common_Filename 11. N_Score_of_Common_Similar_Filename 12. N_Score_of_Common_FilenameHash 13. N_Score_of_Common_FilenameHash80 14. N_Score_of_Common_ExactFilenameHash80 15. N_Data_Common_Filenames 16. N_Data_Common_Similar_Filenames 17. N_Data_Common_FilenameHashes 18. N_Data_Common_FilenameHash80 19. N_Data_Common_ExactFilenameHash 20. N_Data_Score_of_Common_Filename 21. N_Data_Score_of_Common_Similar_Filename 22. N_Data_Score_of_Common_FilenameHash 23. N_Data_Score_of_Common_FilenameHash80 24. N_Data_Score_of_Common_ExactFilenameHash80 25. N_Common_ExactHash 26. N_Common_DataExactHash
  • 22. Classification • Consider feature vectors as n-dimensional points in space. • Linear classifiers • Non-linear classifiers • Decision trees Class A Class B
  • 23. Feature Extraction – Embedded clone detection 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. Percent_Match_In_A 6. Percent_Data_Match_In_A 7. Percent_Match_In_B 8. Percent_Data_Match_In_B 9. Percent_Score_In_A 10.Percent_Data_Score_In_A 11.Percent_Score_In_B 12.Percent_Data_Score_In_B 13.A_Has_Lib_In_Name 14.B_Has_Lib_In_Name 15.A_To_B_Ratio 16.A_To_B_Data_Ratio 17.N_Dependents_A 18.N_Dependents_B
  • 24. Detecting copyright violations 1. Identify embedded package clones. 2. Extract license information of each package. 3. For each GPL licensed embedded package clone: ▫ Verify that the package it is embedded in is not licensing it under a permissive license.
  • 25. Automated Vulnerability Inference 1. Take CVE, match CPE name to Debian package. 2. Parse CVE summary and extract vuln filename. 3. Find clones of package with similar filename. 4. Trim dynamically linked clones. 5. Is vuln affected clone already being tracked?
  • 28. Shared package clone evaluation Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 439/322 484/56296 57.69% 0.85% Multilayer Perceptron 204/557 48/56732 26.81% 0.08% C4.5 523/238 86/56694 68.73% 0.15% Random Forest 533/228 60/56720 70.04% 0.11% Random Forest (0.8) 446/315 15/56765 58.61% 0.03%
  • 29. Embedded clone detection evaluation Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 718/43 6341/2808 94.35% 69.31% Multilayer Perceptron 328/433 108/9041 43.10% 1.18% C4.5 572/189 69/9080 75.16% 0.75% Random Forest 554/207 68/9081 72.80% 0.74% Asymmetric Bagging 699/62 615/8534 91.86% 6.72%
  • 30. Automatic detection of suspicious clones PACKAGE EMBEDDED PACKAGE freevo feedparser hedgewars freetype ia32-libs * libtk-img tiff likewise-open curl luatex poppler planet-venus feedparser syslinux libpng vnc4 freetype vtk tiff
  • 31.
  • 32. Future Work • Binary-level clone detection • Integrate into Linux distributions • Linux security teams usage
  • 33. Clonewise summary • Practical clone detection in Linux • Improves manual only tracking • Has found bugs • Debian Linux want to integrate it into infrastructure • Open source project • Web service to perform clone detection
  • 34. Detecting bugs in binaries using decompilation and data flow analysis
  • 35. Motivation • Detecting bugs in binary is useful ▫ Black-box penetration testing ▫ External audits and compliance ▫ Quality assurance of 3rd party software ▫ Verification of compilation and linkage
  • 36. Wire – A formal language for binary analysis • x86 is complex and big • Wire is a low level RISC assembly style language • Translated from x86 • Formally defined operational semantics The LOAD instruction implements a memory read.
  • 37. Stack Pointer Inference • Proposed in HexRays decompiler - http://www.hexblog.com/?p=42 • Estimate Stack Pointer (SP) in and out of basic block ▫ By tracking and estimating SP modifications using linear inequalities • Solve. Picture from HexRays blog .
  • 38. Decompilation - Local Variable Recovery • Based on stack pointer inference • Access to memory offset to the stack • Replace with native Wire register Imark ($0x80483f5, , ) AddImm32 (%esp(4), $0x1c, %temp_memreg(12c)) LoadMem32 (%temp_memreg(12c), , %temp_op1d(66)) Imark ($0x80483f5, , ) Imark ($0x80483f9, , ) Imark ($0x80483f9, , ) StoreMem32(%temp_op1d(66), , %esp(4))  Imark ($0x80483fc, , ) Imark ($0x80483fc, , ) Free (%local_28(186bc), , ) SubImm32 (%esp(4), $0x4, %esp(4)) LoadImm32 ($0x80483fc, , %temp_op1d(66)) StoreMem32(%temp_op1d(66), , %esp(4)) Lcall (, , $0x80482f0)
  • 39. Data Flow Analysis - Reaching Definitions • A reaching definition is a definition of a variable that reaches a program point without being redefined. X=1 Y=3 X>2 X <=2 X=2 Print(X) Print(X) Y=3, X=1, and X=2 are Print(X) reaching definitions
  • 40. More data flow problems • Upward Exposed Uses ▫ All uses of a definition • Live Variables ▫ A variable is live if it will be subsequently read without being redefined. • Reaching Copies ▫ The reach of a copy statement • etc
  • 41. getenv() bugs • Detect unsafe applications of getenv() • Example: strcpy(buf,getenv(“HOME”)) • For each getenv() ▫ If return value is live ▫ And it‟s the reaching definition to the 2nd argument to strcpy() ▫ Then warn • P.S. 2001 wants its bugs back.
  • 42. Use-after-free Detection • For each free(ptr) ▫ If ptr live void f(int x) ▫ Then warn { int *p = malloc(10); dowork(p); free(p); if (x) p[0] = 1; }
  • 43. Double Free Detection • For each free(ptr) ▫ If an upward exposed use of ptr‟s definition is free(ptr) void f(int x) ▫ Then warn { int *p = malloc(10); dowork(p); free(p); if (x) free(p); • 2001 calls again }
  • 44. getenv() bugs • Scanned entire Debian 7 unstable repository • ~123,000 ELF binaries 4digits ptop acedb-other-belvu recordmydesktop acedb-other-dotter rlplot bvi sapphire • 85 bug reports comgt csmash sc scm elvis-tiny sgrep • 47 packages fvwm garmin-ant-downloader slurm-llnl-slurmdbd statserial gcin stopmotion gexec supertransball2 gmorgan theorur gopher twpsk gsoko udo gstm vnc4server hime wily le-dico-de-rene-cougnenc wmpinboard libreoffice-dev wmppp.app libxgks-dev xboing lie xemacs21-bin lpe xjdic mp3rename xmotd mpich-mpd-bin open-cobol procmail
  • 45. getenv() bugs over time – sorted by binary size • Linear or power growth?
  • 46. getenv() bug statistics • Probability (P) of a binary being vulnerable: 0.00067 • P. of a package being vulnerable: 0.00255 P( A B) P( A | B) P(B) Conditional probability of A given that B has occurred: • P. of a package having a 2nd vulnerability given that one binary in the package is vulnerable: 0.52380
  • 47.
  • 48. Double free in SGID games “xonix” memset(score_rec[i].login, 0, 11); strncpy(score_rec[i].login, pw->pw_name, 10); memset(score_rec[i].full, 0, 65); strncpy(score_rec[i].full, fullname, 64); score_rec[i].tstamp = time(NULL); free(fullname); if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) { fprintf(stderr, "xonix: cannot reopen high score filen"); free(fullname); gameover_pending = 0; return; }
  • 49. Future Work • Core ▫ Summary-based interprocedural analysis ▫ Context sensitive interprocedural analysis ▫ Pointer analysis ▫ Improved decompilation • More bug classes
  • 50. Bugwise summary • Practical tool to find simple bugs • Based on strong theory • Extensible • Much work to do in the future • Web service free to use
  • 51.
  • 52. Future Work • Make more of my research public • Provide better backend infrastructure • Get people to use the services!
  • 53. Conclusion • All of the tools in this talk are for public use • http://www.FooCodeChu.com ▫ Wiki on software similarity and classification ▫ Preprint of my book available • Buy my book from Springer