SlideShare a Scribd company logo
EFFECTIVE FLOWGRAPH-
BASED MALWARE VARIANT
DETECTION
Silvio Cesare, Ph.D. Candidate, Deakin University
http://www.foocodechu.com
silvio.cesare@gmail.com
WHO AM I AND WHERE DID THIS TALK
COME FROM?
   Ph.D. Student at Deakin University.

   Research interests include:
     Automated vulnerability discovery.
     Software similarity and classification.
     Malware detection.


   This presentation is based on my malware
    research.
OUTLINE
1.   Introduction (you might already know this)

2.   New approaches to flowgraph-based
     classification

3.   Evaluation

4.   Other things we use our system on.

5.   Conclusion
INTRODUCTION
This is to make sure everyone is up to speed.
If you’ve been to my presentations before, you
might have already seen it.
INTRODUCTION
   Malware a significant problem.
   Static detection of malware a dominant real-time
    technique.
   Detecting unknown variants from known
    samples very useful.                     Roron.ao
                                Klez.a       Roron.b
                                Klez.b       Roron.d
                                Klez.c       Roron.e
                                Klez.d       Roron.f
                                ...          ...
SIGNATURES AND BIRTHMARKS
   A birthmark is an invariant property in related
    samples.

   Birthmark comparison should allow inexact
    matching.
LIMITATIONS OF EXISTING
BIRTHMARKS
   Byte-level content can change in every variant.

   Comparing birthmarks often exact matching
    only.

   Inefficient for inexact database searching.

   Unable to detect unknown variants of known
    samples.

   Program structure a better birthmark.
THE SOFTWARE SIMILARITY
PROBLEM
THE SOFTWARE SIMILARITY
SEARCH
 Need a dissimilarity or distance metric.
 “Metric” property allows efficient database
  search.                     Query Benign

                                         r
                                   q
                          d(p,q)
              p
                                        Query Malicious
                  Query

                  Malware
EXISTING APPROACHES: A CALL GRAPH
BIRTHMARK
   Inter-procedureal control flow.
AN OPTIMAL DISSIMILARITY METRIC
FOR GRAPHS
   Graph edit distance.

   Number of operations to transform one graph to
    another.

   Complexity in NP.

   Non optimal solutions possible in cubic time.
OUR APPROACH: A SET OF CONTROL
FLOW GRAPHS BIRTHMARK
 Intra-procedural control flow.
 Many procedures.
TRANSFORMING GRAPH DISSIMILARITY
TO A STRING DISSIMILARITY PROBLEM
 Decompile control flow graphs to strings.
 Compare strings using ‘string metrics’.
                                           proc (){
                             L_0                                  W|IEH}R
                                           L_0:
                                             while (v1 || v2) {
                             L_3           L_1:
                                               if (v3) {
               true                        L_2:
                             L_6
                                               } else {
                      true                 L_4:
                                               }
               L_1           L_7           L_5:
                                    true     }
               true                        L_7:
                                             return ;
               L_2           L_4
                                           }
                             true

                             L_5
NEW APPROACHES TO
FLOWGRAPH-BASED
CLASSIFICATION
TRANSFORMING A SET OF STRINGS
PROBLEM INTO A STRING PROBLEM
   Decompiled CFGs give us a set of strings.
                                             R    W|IEH}R



   Order and concatenate strings.      W|IEH}R     R


                                                            W|IEH}RZ
                                                            RZ
                                         W|}R      W|}R     W|}RZ

   Deliminate substrings with ‘Z’.
                                                            IEHRZ
                                                            SRZ

                                         IEHR      IEHR




   Order based on metrics.               SR        SR




     Number of instructions in procedure.
     Number of basic blocks.
     etc
WHAT WE TRIED (AND ENDED UP NOT
USING)
   String metrics:

       Edit distance    ed(“hello”, “ggello”) = 2

                                                                   C ( xy ) −min{C ( x), (C , y )}
       Normalized Compression Distance          NCD ( x, y ) =
                                                                        max{C ( x), C ( y )}


                                   A K TKT K
       Sequence alignment        | | | | |
                                   ATKTT T K
   All databases indexed using metric trees.
SEQUENCE ALIGNMENT WITH
BLAST
   A heuristic genome sequence search tool.

   Local sequence alignment.

   Hugely popular in bioinformatics.

   So.. transform our strings into genome
    sequences.

   Then, do a genome search.
GENOME SEQUENCE EXTRACTION
                               proc (){
                 L_0                                  W|IEH}R
                               L_0:
                                 while (v1 || v2) {
                 L_3           L_1:
                                   if (v3) {
   true                        L_2:
                 L_6
                                   } else {
          true                 L_4:
                                   }




                                                                 ACGTRYKMACGTRYKM
  L_1            L_7           L_5:
                        true     }
  true                         L_7:
                                 return ;
  L_2            L_4
                               }
                 true

                 L_5




                 L_0
                               proc (){
                               L_0:
                                                      W|IEH}R
                                                                    A =   Adeline
                                 while (v1 || v2) {



   true

          true
                 L_3

                 L_6
                               L_1:

                               L_2:

                               L_4:
                                   if (v3) {

                                   } else {
                                                                    C =   Cytosine
                                                                    G =   Guanine
                                   }
  L_1            L_7           L_5:
                        true     }
  true                         L_7:
                                 return ;
  L_2            L_4
                               }



                                                                    T =   Thyamine
                 true

                 L_5




                                                                    ...
WHY DIDN’T WE USE THOSE
APPROACHES?
   Not optimally effective.

   Too slow.

   Best speed was using NCD.
A DISSIMILARITY METRIC FOR SETS OF
STRINGS (WHAT WE ENDED UP USING)
   Find a mapping between strings to minimize the
    sum of distances.
                                  p
                                      d=ed(p,q)
                                                  q
         BR

         BW|{B}BR
                    BR
         BI{B}BR
                    BW|{B}BR
         BSSR
                    BSSR
         BSR
         BSSSR
COMBINATORIAL OPTIMISATION: THE
ASSIGNMENT PROBLEM
   Finding a minimum cost mapping is known as
    the “assignment problem”

   Optimal solutions exist in cubic time.

   “Greedy” heuristic solutions faster.

   Has the properties of a metric.
EVALUATION
IMPLEMENTATION
   Malwise system is 100,000 lines of code of C++.

   The modules for this work < 3000 lines of code.

   Unpacks malware using an application level
    emulator (Ruxcon 2010)

   Pre-filtering stage to quickly cull non matching
    variants (Ruxcon 2011)
EVALUATION - EFFECTIVENESS
   Calculated similarity between Roron malware
    variants.

   Compared results to Ruxcon 2010 work.

   In tables, highlighted cells indicates a positive
    match.

   The more matches the more effective it is.
EVALUATION - EFFECTIVENESS
        ao       b       d       e       g        k       m         q         a            ao       b      d      e      g      k     m       q      a
  ao          0.44    0.28    0.27    0.28     0.55     0.44     0.44      0.47      ao          0.70   0.28   0.28   0.27   0.75   0.70   0.70   0.75
  b    0.44           0.27    0.27    0.27     0.51     1.00     1.00      0.58      b    0.74          0.31   0.34   0.33   0.82   1.00   1.00   0.87
  d    0.28   0.27            0.48    0.56     0.27     0.27     0.27      0.27      d    0.28   0.29          0.50   0.74   0.29   0.29   0.29   0.29
  e    0.27   0.27    0.48            0.59     0.27     0.27     0.27      0.27      e    0.31   0.34   0.50          0.64   0.32   0.34   0.34   0.33
  g    0.28   0.27    0.56    0.59             0.27     0.27     0.27      0.27      g    0.27   0.33   0.74   0.64          0.29   0.33   0.33   0.30
  k    0.55   0.51    0.27    0.27    0.27              0.51     0.51      0.75      k    0.75   0.82   0.29   0.30   0.29          0.82   0.82   0.96
  m    0.44   1.00    0.27    0.27    0.27     0.51              1.00      0.58      m    0.74   1.00   0.31   0.34   0.33   0.82          1.00   0.87
  q    0.44   1.00    0.27    0.27    0.27     0.51     1.00               0.58      q    0.74   1.00   0.31   0.34   0.33   0.82   1.00          0.87
  a    0.47   0.58    0.27    0.27    0.27     0.75     0.58     0.58                a    0.75   0.87   0.30   0.31   0.30   0.96   0.87   0.87



                Exact Matching                                                       Heuristic Approximate
                (Ruxcon 2010)                                                        Matching (Ruxcon 2010)
         ao       b       d       e        g        k       m          q         a
  ao           0.86    0.49    0.54     0.50     0.87     0.86      0.86      0.86
  b    0.87            0.57    0.63     0.62     0.96     1.00      1.00      0.96
  d    0.61    0.64            0.85     0.91     0.64     0.64      0.64      0.64
  e    0.64    0.69    0.85             0.90     0.68     0.69      0.69      0.68
  g    0.62    0.68    0.91    0.91              0.68     0.68      0.68      0.68
  k    0.88    0.96    0.58    0.62     0.61              0.96      0.96      0.99
  m    0.87    1.00    0.57    0.63     0.62     0.96               1.00      0.96
  q    0.87    1.00    0.57    0.63     0.62     0.96     1.00                0.96
  a    0.87    0.96    0.58    0.62     0.61     0.99     0.96      0.96



               Assignment problem
EVALUATION – FALSE POSITIVES
   Database of 10,000 malware.

   Scanned 1,601 benign binaries.

   7 false positives. Less than 1%.

   Very small binaries have small signatures and
    cause weak matching.
EVALUATION - EFFICIENCY
   Median benign and malware processing time is
    0.06s and 0.84s.
                                                       Malware
                        % Samples     Benign Time(s)
                                                       Time(s)
                                 10             0.02         0.16
                                 20             0.02         0.28
                                 30             0.03         0.30
                                 40             0.03         0.36
                                 50             0.06         0.84
                                 60             0.09         0.94
                                 70             0.13         0.97
                                 80             0.25         1.03
                                 90             0.56         1.31
                                100             8.06       585.16
BUT THAT’S NOT ALL WE
USE THE MALWISE ENGINE
FOR..
SIMSEER – A SOFTWARE SIMILARITY
WEB SERVICE
   An online service to identify similarity between
    programs
   Based on Malwise.
   Renders an evolutionary tree to show program
    relationships.
   Free to use!
   http://www.foocodechu.com/?q=simseer-a-software-similarit
SIMSEER - DEMO
   http://www.youtube.com/watch?v=ymo7DKlKCH4
BUGWISE
   Automatically detect bugs and vulnerabilities in
    Linux executable binaries.
   Uses static program analysis from Malwise.
     Decompilation
     Data Flow Analysis 

   Free to use!
   http://www.foocodechu.com/?q=bugwise-a-bug-detection-we
BUGWISE – SGID GAMES XONIX BUG IN
DEBIAN LINUX
  memset(score_rec[i].login, 0, 11);

  strncpy(score_rec[i].login, pw->pw_name, 10);

  memset(score_rec[i].full, 0, 65);

  strncpy(score_rec[i].full, fullname, 64);

  score_rec[i].tstamp = time(NULL);

  free(fullname);


  if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) {

      fprintf(stderr, "xonix: cannot reopen high score filen");

      free(fullname);
      gameover_pending = 0;

      return;

  }
PUBLICATIONS
 Book published by Springer.
 http://www.springer.com/computer/security+and+
  cryptology/book/978-1-4471-2908-0
CONCLUSION
   Malwise effectively identifies malware variants.

   Runs in real-time in expected case.

   Large functional code base and years of
    development time.

   Happy to talk to vendors.

   http://www.FooCodeChu.com

More Related Content

What's hot

Why we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingWhy we cannot ignore Functional Programming
Why we cannot ignore Functional Programming
Mario Fusco
 
Latest C Interview Questions and Answers
Latest C Interview Questions and AnswersLatest C Interview Questions and Answers
Latest C Interview Questions and Answers
DaisyWatson5
 

What's hot (10)

Why we cannot ignore Functional Programming
Why we cannot ignore Functional ProgrammingWhy we cannot ignore Functional Programming
Why we cannot ignore Functional Programming
 
pointer, virtual function and polymorphism
pointer, virtual function and polymorphismpointer, virtual function and polymorphism
pointer, virtual function and polymorphism
 
A/F/C-orientation
A/F/C-orientationA/F/C-orientation
A/F/C-orientation
 
Do we need a logic of quantum computation?
Do we need a logic of quantum computation?Do we need a logic of quantum computation?
Do we need a logic of quantum computation?
 
automata problems
automata problemsautomata problems
automata problems
 
05 dataflow
05 dataflow05 dataflow
05 dataflow
 
Lazy java
Lazy javaLazy java
Lazy java
 
Latest C Interview Questions and Answers
Latest C Interview Questions and AnswersLatest C Interview Questions and Answers
Latest C Interview Questions and Answers
 
Seminar on quantum automata (complete)
Seminar on quantum automata (complete)Seminar on quantum automata (complete)
Seminar on quantum automata (complete)
 
Components - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsComponents - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API Limitations
 

Viewers also liked (7)

Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
 
Faster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware ClassificationFaster, More Effective Flowgraph-based Malware Classification
Faster, More Effective Flowgraph-based Malware Classification
 
Statistical flowgraph models
Statistical flowgraph modelsStatistical flowgraph models
Statistical flowgraph models
 
Software Testing
Software TestingSoftware Testing
Software Testing
 
Software testing methods, levels and types
Software testing methods, levels and typesSoftware testing methods, levels and types
Software testing methods, levels and types
 
Software Testing Basics
Software Testing BasicsSoftware Testing Basics
Software Testing Basics
 
Software Testing Fundamentals
Software Testing FundamentalsSoftware Testing Fundamentals
Software Testing Fundamentals
 

Similar to Effective flowgraph-based malware variant detection

Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
ESCOM
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
Silvio Cesare
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
Silvio Cesare
 
Dot Call interface
Dot Call interfaceDot Call interface
Dot Call interface
Hao Chai
 
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxCSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
mydrynan
 
Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012
Coen De Roover
 
Coffee script
Coffee scriptCoffee script
Coffee script
timourian
 

Similar to Effective flowgraph-based malware variant detection (20)

A Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisA Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer Analysis
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Lecture20 vector
Lecture20 vectorLecture20 vector
Lecture20 vector
 
Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
 
Seductions of Scala
Seductions of ScalaSeductions of Scala
Seductions of Scala
 
COMPILER_DESIGN_CLASS 2.ppt
COMPILER_DESIGN_CLASS 2.pptCOMPILER_DESIGN_CLASS 2.ppt
COMPILER_DESIGN_CLASS 2.ppt
 
COMPILER_DESIGN_CLASS 1.pptx
COMPILER_DESIGN_CLASS 1.pptxCOMPILER_DESIGN_CLASS 1.pptx
COMPILER_DESIGN_CLASS 1.pptx
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
(chapter 4) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 4) A Concise and Practical Introduction to Programming Algorithms in...(chapter 4) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 4) A Concise and Practical Introduction to Programming Algorithms in...
 
Lambda Calculus
Lambda CalculusLambda Calculus
Lambda Calculus
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
 
Dot Call interface
Dot Call interfaceDot Call interface
Dot Call interface
 
Relational Algebra and Calculus.ppt
Relational Algebra and Calculus.pptRelational Algebra and Calculus.ppt
Relational Algebra and Calculus.ppt
 
1-11.pdf
1-11.pdf1-11.pdf
1-11.pdf
 
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxCSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
 
First approach in linq
First approach in linqFirst approach in linq
First approach in linq
 
Illicium: Compiling Pharo to C
Illicium: Compiling Pharo to CIllicium: Compiling Pharo to C
Illicium: Compiling Pharo to C
 
Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012Ekeko Technology Showdown at SoTeSoLa 2012
Ekeko Technology Showdown at SoTeSoLa 2012
 
Coffee script
Coffee scriptCoffee script
Coffee script
 

More from Silvio Cesare

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
Silvio Cesare
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Silvio Cesare
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Silvio Cesare
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Silvio Cesare
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
Silvio Cesare
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
Silvio Cesare
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
Silvio Cesare
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
Silvio Cesare
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
Silvio Cesare
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
Silvio Cesare
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
Silvio Cesare
 

More from Silvio Cesare (14)

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Effective flowgraph-based malware variant detection

  • 1. EFFECTIVE FLOWGRAPH- BASED MALWARE VARIANT DETECTION Silvio Cesare, Ph.D. Candidate, Deakin University http://www.foocodechu.com silvio.cesare@gmail.com
  • 2. WHO AM I AND WHERE DID THIS TALK COME FROM?  Ph.D. Student at Deakin University.  Research interests include:  Automated vulnerability discovery.  Software similarity and classification.  Malware detection.  This presentation is based on my malware research.
  • 3. OUTLINE 1. Introduction (you might already know this) 2. New approaches to flowgraph-based classification 3. Evaluation 4. Other things we use our system on. 5. Conclusion
  • 4. INTRODUCTION This is to make sure everyone is up to speed. If you’ve been to my presentations before, you might have already seen it.
  • 5. INTRODUCTION  Malware a significant problem.  Static detection of malware a dominant real-time technique.  Detecting unknown variants from known samples very useful. Roron.ao Klez.a Roron.b Klez.b Roron.d Klez.c Roron.e Klez.d Roron.f ... ...
  • 6. SIGNATURES AND BIRTHMARKS  A birthmark is an invariant property in related samples.  Birthmark comparison should allow inexact matching.
  • 7. LIMITATIONS OF EXISTING BIRTHMARKS  Byte-level content can change in every variant.  Comparing birthmarks often exact matching only.  Inefficient for inexact database searching.  Unable to detect unknown variants of known samples.  Program structure a better birthmark.
  • 9. THE SOFTWARE SIMILARITY SEARCH  Need a dissimilarity or distance metric.  “Metric” property allows efficient database search. Query Benign r q d(p,q) p Query Malicious Query Malware
  • 10. EXISTING APPROACHES: A CALL GRAPH BIRTHMARK  Inter-procedureal control flow.
  • 11. AN OPTIMAL DISSIMILARITY METRIC FOR GRAPHS  Graph edit distance.  Number of operations to transform one graph to another.  Complexity in NP.  Non optimal solutions possible in cubic time.
  • 12. OUR APPROACH: A SET OF CONTROL FLOW GRAPHS BIRTHMARK  Intra-procedural control flow.  Many procedures.
  • 13. TRANSFORMING GRAPH DISSIMILARITY TO A STRING DISSIMILARITY PROBLEM  Decompile control flow graphs to strings.  Compare strings using ‘string metrics’. proc (){ L_0 W|IEH}R L_0: while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: } L_1 L_7 L_5: true } true L_7: return ; L_2 L_4 } true L_5
  • 15. TRANSFORMING A SET OF STRINGS PROBLEM INTO A STRING PROBLEM  Decompiled CFGs give us a set of strings. R W|IEH}R  Order and concatenate strings. W|IEH}R R W|IEH}RZ RZ W|}R W|}R W|}RZ  Deliminate substrings with ‘Z’. IEHRZ SRZ IEHR IEHR  Order based on metrics. SR SR  Number of instructions in procedure.  Number of basic blocks.  etc
  • 16. WHAT WE TRIED (AND ENDED UP NOT USING)  String metrics:  Edit distance  ed(“hello”, “ggello”) = 2 C ( xy ) −min{C ( x), (C , y )}  Normalized Compression Distance  NCD ( x, y ) = max{C ( x), C ( y )} A K TKT K  Sequence alignment  | | | | | ATKTT T K  All databases indexed using metric trees.
  • 17. SEQUENCE ALIGNMENT WITH BLAST  A heuristic genome sequence search tool.  Local sequence alignment.  Hugely popular in bioinformatics.  So.. transform our strings into genome sequences.  Then, do a genome search.
  • 18. GENOME SEQUENCE EXTRACTION proc (){ L_0 W|IEH}R L_0: while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: }  ACGTRYKMACGTRYKM L_1 L_7 L_5: true } true L_7: return ; L_2 L_4 } true L_5 L_0 proc (){ L_0: W|IEH}R A = Adeline while (v1 || v2) { true true L_3 L_6 L_1: L_2: L_4: if (v3) { } else { C = Cytosine G = Guanine } L_1 L_7 L_5: true } true L_7: return ; L_2 L_4 } T = Thyamine true L_5 ...
  • 19. WHY DIDN’T WE USE THOSE APPROACHES?  Not optimally effective.  Too slow.  Best speed was using NCD.
  • 20. A DISSIMILARITY METRIC FOR SETS OF STRINGS (WHAT WE ENDED UP USING)  Find a mapping between strings to minimize the sum of distances. p d=ed(p,q) q BR BW|{B}BR BR BI{B}BR BW|{B}BR BSSR BSSR BSR BSSSR
  • 21. COMBINATORIAL OPTIMISATION: THE ASSIGNMENT PROBLEM  Finding a minimum cost mapping is known as the “assignment problem”  Optimal solutions exist in cubic time.  “Greedy” heuristic solutions faster.  Has the properties of a metric.
  • 23. IMPLEMENTATION  Malwise system is 100,000 lines of code of C++.  The modules for this work < 3000 lines of code.  Unpacks malware using an application level emulator (Ruxcon 2010)  Pre-filtering stage to quickly cull non matching variants (Ruxcon 2011)
  • 24. EVALUATION - EFFECTIVENESS  Calculated similarity between Roron malware variants.  Compared results to Ruxcon 2010 work.  In tables, highlighted cells indicates a positive match.  The more matches the more effective it is.
  • 25. EVALUATION - EFFECTIVENESS ao b d e g k m q a ao b d e g k m q a ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 ao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75 b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87 d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29 e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33 g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30 k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96 m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87 q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87 a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58 a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87 Exact Matching Heuristic Approximate (Ruxcon 2010) Matching (Ruxcon 2010) ao b d e g k m q a ao 0.86 0.49 0.54 0.50 0.87 0.86 0.86 0.86 b 0.87 0.57 0.63 0.62 0.96 1.00 1.00 0.96 d 0.61 0.64 0.85 0.91 0.64 0.64 0.64 0.64 e 0.64 0.69 0.85 0.90 0.68 0.69 0.69 0.68 g 0.62 0.68 0.91 0.91 0.68 0.68 0.68 0.68 k 0.88 0.96 0.58 0.62 0.61 0.96 0.96 0.99 m 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96 q 0.87 1.00 0.57 0.63 0.62 0.96 1.00 0.96 a 0.87 0.96 0.58 0.62 0.61 0.99 0.96 0.96 Assignment problem
  • 26. EVALUATION – FALSE POSITIVES  Database of 10,000 malware.  Scanned 1,601 benign binaries.  7 false positives. Less than 1%.  Very small binaries have small signatures and cause weak matching.
  • 27. EVALUATION - EFFICIENCY  Median benign and malware processing time is 0.06s and 0.84s. Malware % Samples Benign Time(s) Time(s) 10 0.02 0.16 20 0.02 0.28 30 0.03 0.30 40 0.03 0.36 50 0.06 0.84 60 0.09 0.94 70 0.13 0.97 80 0.25 1.03 90 0.56 1.31 100 8.06 585.16
  • 28. BUT THAT’S NOT ALL WE USE THE MALWISE ENGINE FOR..
  • 29. SIMSEER – A SOFTWARE SIMILARITY WEB SERVICE  An online service to identify similarity between programs  Based on Malwise.  Renders an evolutionary tree to show program relationships.  Free to use!  http://www.foocodechu.com/?q=simseer-a-software-similarit
  • 30.
  • 31.
  • 32. SIMSEER - DEMO  http://www.youtube.com/watch?v=ymo7DKlKCH4
  • 33. BUGWISE  Automatically detect bugs and vulnerabilities in Linux executable binaries.  Uses static program analysis from Malwise.  Decompilation  Data Flow Analysis   Free to use!  http://www.foocodechu.com/?q=bugwise-a-bug-detection-we
  • 34. BUGWISE – SGID GAMES XONIX BUG IN DEBIAN LINUX memset(score_rec[i].login, 0, 11); strncpy(score_rec[i].login, pw->pw_name, 10); memset(score_rec[i].full, 0, 65); strncpy(score_rec[i].full, fullname, 64); score_rec[i].tstamp = time(NULL); free(fullname); if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) { fprintf(stderr, "xonix: cannot reopen high score filen"); free(fullname); gameover_pending = 0; return; }
  • 35. PUBLICATIONS  Book published by Springer.  http://www.springer.com/computer/security+and+ cryptology/book/978-1-4471-2908-0
  • 36. CONCLUSION  Malwise effectively identifies malware variants.  Runs in real-time in expected case.  Large functional code base and years of development time.  Happy to talk to vendors.  http://www.FooCodeChu.com