SlideShare a Scribd company logo
Silvio Cesare silvio.cesare@gmail.com
          http://www.foocodechu.com
    Ph.D. Candidate, Deakin University
   Ph.D. Candidate at Deakin University.
   Research
    ◦ Malware detection.
    ◦ Automated vulnerability discovery (check out my
      other talk in the main conference).
   Did a Masters by research in malware
    ◦ “Fast automated unpacking and classification of
      malware”.
    ◦ Presented last year at Ruxcon 2010.
   This current work extends last year’s work.
   Traditional AV works well on known samples.

   Doesn’t detect unknown samples.

   Doesn’t detect “suspiciously similar” samples.

   Uses strings as a signature or “birthmark”.

   Compares birthmarks by equality.
   Birthmarks can be program structure.

   More static among malware variants.

   Birthmarks can be compared using “approximate
    similarity”.

   Able to detect unknown samples that are
    suspiciously similar to known malware.

   Vastly reduce number of required signatures.
Program p   Birthmark              MATCH!



                        Similar?



Program q   Birthmark              Different
   Control flow is more invariant among
    polymorphic and metamorphic malware.

   A directed graph representing control flow.

   A control flow graph for every procedure.

   One call graph per program.
lea     0x4(%esp),%ecx
and     $0xfffffff0,%esp                    Proc_0
pushl   -0x4(%ecx)
push    %ebp
mov     %esp,%ebp
push    %ecx
sub     $0x24,%esp
call    4011b0 <___main>
movl    $0x0,-0x8(%ebp)
jmp     40115f <_main+0x2f>
                                   Proc_1            Proc_3


         movl   $0x4020a0,(%esp)
         call   4011b8 <_puts>
         addl   $0x1,-0x8(%ebp)



cmpl    $0x9,-0x8(%ebp)            Proc_4
jle     40114f <_main+0x1f>




add     $0x24,%esp
pop     %ecx
pop     %ebp                                Proc_2
lea     -0x4(%ecx),%esp
ret
   Known as the “Graph Isomorphism” problem.

   Identifies equivalent “structure”.

   Not proven to be in NP, but no polynomial
    time algorithm known.
   The number of basic operations applied to a
    graph to transform it to another graph.

   If you know the distance between two
    objects, you know the similarity.

   Complexity in NP and infeasible.
proc(){
              L_0           L_0:                   W|IEH}R
                              while (v1 || v2) {
              L_3           L_1:
                                if (v3) {
true                        L_2:
              L_6
                                } else {
       true                 L_4:
                                }
L_1           L_7           L_5:
                     true     }
true                        L_7:
                              return;
L_2           L_4
                            }
              true

              L_5
   Input is a string.

   Extract all substrings of fixed size Q.

   Substrings are known as q-grams.

   Let’s take q-grams of all decompiled graphs.

                                          W|IE
                                          |IEH
                          W|IEH}R
                                          IEH}
                                          EH}R
   An array <E1,...,En>

   A feature vector describes the number of
    occurrences of each feature.

   En is the number of times feature En occurs.

   Let’s make the 500 most common q-grams
    as features.

   We use feature vectors as birthmarks.
   A vector is an n-dimensional point.
   E.g. 2d vector is <x,y>
   Fast.
   Software similarity problem extended to
    similarity search over a database.

   Find nearest neighbours (by distance) of a
    query.

   Or find neighbours within a distance of the
    query.
Query Benign

                                     r
                      q
            d(p,q)

p
                                    Query Malicious
    Query

    Malware
   Vector distances here are “metric”.

   It has the mathematical properties of a
    metric.

   This means you can do a nearest neighbour
    search without brute forcing the entire
    database!
   System is 100,000 lines of code of C++.

   The modules for this work < 3000 lines of code.

   System translates x86 into an intermediate
    language (IL).

   Performs analysis on architecture independent IL.

   Unpacks malware using an application level
    emulator.
   Database of 10,000 malware.

   Scanned 1,601 benign binaries.

   10 false positives. Less than 1%.

   Using additional refinement algorithm,
    reduced to 7 false positives.

   Very small binaries have small signatures and
    cause weak matching.
   Calculated similarity between Roron malware
    variants.

   Compared results to Ruxcon 2010 work.

   In tables, highlighted cells indicates a positive
    match.

   The more matches the more effective it is.
ao       b      d       e      g      k     m       q      a         ao       b      d      e      g      k      m      q      a
ao          0.44   0.28    0.27   0.28   0.55   0.44   0.44   0.47   ao          0.70   0.28   0.28   0.27   0.75   0.70   0.70   0.75
b    0.44          0.27    0.27   0.27   0.51   1.00   1.00   0.58   b    0.74          0.31   0.34   0.33   0.82   1.00   1.00   0.87
d    0.28   0.27           0.48   0.56   0.27   0.27   0.27   0.27   d    0.28   0.29          0.50   0.74   0.29   0.29   0.29   0.29
e    0.27   0.27   0.48           0.59   0.27   0.27   0.27   0.27   e    0.31   0.34   0.50          0.64   0.32   0.34   0.34   0.33
g    0.28   0.27   0.56    0.59          0.27   0.27   0.27   0.27   g    0.27   0.33   0.74   0.64          0.29   0.33   0.33   0.30
k    0.55   0.51   0.27    0.27   0.27          0.51   0.51   0.75   k    0.75   0.82   0.29   0.30   0.29          0.82   0.82   0.96
m    0.44   1.00   0.27    0.27   0.27   0.51          1.00   0.58   m    0.74   1.00   0.31   0.34   0.33   0.82          1.00   0.87
q    0.44   1.00   0.27    0.27   0.27   0.51   1.00          0.58   q    0.74   1.00   0.31   0.34   0.33   0.82   1.00          0.87
a    0.47   0.58   0.27    0.27   0.27   0.75   0.58   0.58          a    0.75   0.87   0.30   0.31   0.30   0.96   0.87   0.87


                   Exact Matching                                    Heuristic Approximate
                   (Ruxcon 2010)                                     Matching (Ruxcon 2010)
      ao       b       d      e      g      k     m       q      a
ao          0.86    0.53   0.64   0.59   0.86   0.86   0.86   0.86
b    0.88           0.66   0.76   0.71   0.97   1.00   1.00   0.97
d    0.65   0.72           0.88   0.93   0.73   0.72   0.72   0.73
e    0.72   0.80    0.87          0.93   0.80   0.80   0.80   0.80
g    0.69   0.77    0.93   0.93          0.77   0.77   0.77   0.77
k    0.88   0.97    0.67   0.77   0.72          0.97   0.97   0.99
m    0.88   1.00    0.66   0.76   0.71   0.97          1.00   0.97
q    0.88   1.00    0.66   0.76   0.71   0.97   1.00          0.97
a    0.87   0.97    0.67   0.77   0.72   0.99   0.97   0.97


                       Q-Grams
   Faster than Ruxcon 2010.
   Median benign processing time is 0.06s.
   Median malware processing time is 0.84s.
   Slowest result may be memory thrashing.
                                 %        Benign     Malware
                               Samples    Time(s)    Time(s)
                                     10       0.02       0.16
                                     20       0.02       0.28
                                     30       0.03       0.30
                                     40       0.03       0.36
                                     50       0.06       0.84
                                     60       0.09       0.94
                                     70       0.13       0.97
                                     80       0.25       1.03
                                     90       0.56       1.31
                                    100       8.06     585.16
   Improved effectiveness and efficiency compared to
    Ruxcon 2010.


   Runs in real-time in expected case.


   Large functional code base and years of development
    time.


   Happy to talk to vendors.
   Full academic paper at IEEE Trustcom.


   Research page http://www.foocodechu.com


   Book on “Software similarity and classification”
    available in 2012.


   Wiki on software similarity and classification
    http://www.foocodechu.com/wiki

More Related Content

Similar to Faster, More Effective Flowgraph-based Malware Classification

Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareSilvio Cesare
 
Fractional Differential Equation for the Analysis of Electrophysiological Rec...
Fractional Differential Equation for the Analysis of Electrophysiological Rec...Fractional Differential Equation for the Analysis of Electrophysiological Rec...
Fractional Differential Equation for the Analysis of Electrophysiological Rec...Mariela Marín
 
Paper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculatorPaper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculatorSérgio Sacani
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Smart Room Gesture Control
Smart Room Gesture ControlSmart Room Gesture Control
Smart Room Gesture Control
Giwrgos Paraskevopoulos
 
Clustering Methods with R
Clustering Methods with RClustering Methods with R
Clustering Methods with R
Akira Murakami
 
Introduction to Python Language and Data Types
Introduction to Python Language and Data TypesIntroduction to Python Language and Data Types
Introduction to Python Language and Data Types
Ravi Shankar
 
Clustering Methods with R
Clustering Methods with RClustering Methods with R
Clustering Methods with R
Akira Murakami
 
AEN-VAR-AEN.pdf
AEN-VAR-AEN.pdfAEN-VAR-AEN.pdf
AEN-VAR-AEN.pdf
AndreyChirikhin1
 
jpg image processing nagham salim_as.ppt
jpg image processing nagham salim_as.pptjpg image processing nagham salim_as.ppt
jpg image processing nagham salim_as.ppt
naghamallella
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
Data Science London
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
Bruno Gonçalves
 
talk9.ppt
talk9.ppttalk9.ppt
DEF CON 23 - Atlas - fun with symboliks
DEF CON 23 - Atlas - fun with symboliksDEF CON 23 - Atlas - fun with symboliks
DEF CON 23 - Atlas - fun with symboliks
Felipe Prado
 
C023014030
C023014030C023014030
C023014030
inventionjournals
 
C023014030
C023014030C023014030
C023014030
inventionjournals
 
A MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATION
A MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATIONA MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATION
A MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATION
ijaia
 
3D Math Primer: CocoaConf Atlanta
3D Math Primer: CocoaConf Atlanta3D Math Primer: CocoaConf Atlanta
3D Math Primer: CocoaConf Atlanta
Janie Clayton
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
Rishabh Gupta
 

Similar to Faster, More Effective Flowgraph-based Malware Classification (20)

Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
 
Fractional Differential Equation for the Analysis of Electrophysiological Rec...
Fractional Differential Equation for the Analysis of Electrophysiological Rec...Fractional Differential Equation for the Analysis of Electrophysiological Rec...
Fractional Differential Equation for the Analysis of Electrophysiological Rec...
 
Paper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculatorPaper and pencil_cosmological_calculator
Paper and pencil_cosmological_calculator
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Smart Room Gesture Control
Smart Room Gesture ControlSmart Room Gesture Control
Smart Room Gesture Control
 
Clustering Methods with R
Clustering Methods with RClustering Methods with R
Clustering Methods with R
 
Introduction to Python Language and Data Types
Introduction to Python Language and Data TypesIntroduction to Python Language and Data Types
Introduction to Python Language and Data Types
 
Clustering Methods with R
Clustering Methods with RClustering Methods with R
Clustering Methods with R
 
AEN-VAR-AEN.pdf
AEN-VAR-AEN.pdfAEN-VAR-AEN.pdf
AEN-VAR-AEN.pdf
 
jpg image processing nagham salim_as.ppt
jpg image processing nagham salim_as.pptjpg image processing nagham salim_as.ppt
jpg image processing nagham salim_as.ppt
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
talk9.ppt
talk9.ppttalk9.ppt
talk9.ppt
 
DEF CON 23 - Atlas - fun with symboliks
DEF CON 23 - Atlas - fun with symboliksDEF CON 23 - Atlas - fun with symboliks
DEF CON 23 - Atlas - fun with symboliks
 
C023014030
C023014030C023014030
C023014030
 
C023014030
C023014030C023014030
C023014030
 
A MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATION
A MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATIONA MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATION
A MODIFIED VORTEX SEARCH ALGORITHM FOR NUMERICAL FUNCTION OPTIMIZATION
 
3D Math Primer: CocoaConf Atlanta
3D Math Primer: CocoaConf Atlanta3D Math Primer: CocoaConf Atlanta
3D Math Primer: CocoaConf Atlanta
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 

More from Silvio Cesare

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGSilvio Cesare
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSSilvio Cesare
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySilvio Cesare
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Silvio Cesare
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...Silvio Cesare
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...Silvio Cesare
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisSilvio Cesare
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSilvio Cesare
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxSilvio Cesare
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Silvio Cesare
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSilvio Cesare
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
Silvio Cesare
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For EmulationSilvio Cesare
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource KernelsSilvio Cesare
 

More from Silvio Cesare (15)

A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKINGA BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
 
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERSA WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
 
Simseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made EasySimseer.com - Malware Similarity and Clustering Made Easy
Simseer.com - Malware Similarity and Clustering Made Easy
 
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...Clonewise  - Automatically Detecting Package Clones and Inferring Security Vu...
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
 
Wire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary AnalysisWire - A Formal Intermediate Language for Binary Analysis
Wire - A Formal Intermediate Language for Binary Analysis
 
Simseer - A Software Similarity Web Service
Simseer - A Software Similarity Web ServiceSimseer - A Software Similarity Web Service
Simseer - A Software Similarity Web Service
 
Automated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in LinuxAutomated Detection of Software Bugs and Vulnerabilities in Linux
Automated Detection of Software Bugs and Vulnerabilities in Linux
 
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
 
Simple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux DistributionsSimple Bugs and Vulnerabilities in Linux Distributions
Simple Bugs and Vulnerabilities in Linux Distributions
 
Malware Classification Using Structured Control Flow
Malware Classification Using Structured Control FlowMalware Classification Using Structured Control Flow
Malware Classification Using Structured Control Flow
 
Security Applications For Emulation
Security Applications For EmulationSecurity Applications For Emulation
Security Applications For Emulation
 
Auditing the Opensource Kernels
Auditing the Opensource KernelsAuditing the Opensource Kernels
Auditing the Opensource Kernels
 

Recently uploaded

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Faster, More Effective Flowgraph-based Malware Classification

  • 1. Silvio Cesare silvio.cesare@gmail.com http://www.foocodechu.com Ph.D. Candidate, Deakin University
  • 2. Ph.D. Candidate at Deakin University.  Research ◦ Malware detection. ◦ Automated vulnerability discovery (check out my other talk in the main conference).  Did a Masters by research in malware ◦ “Fast automated unpacking and classification of malware”. ◦ Presented last year at Ruxcon 2010.  This current work extends last year’s work.
  • 3. Traditional AV works well on known samples.  Doesn’t detect unknown samples.  Doesn’t detect “suspiciously similar” samples.  Uses strings as a signature or “birthmark”.  Compares birthmarks by equality.
  • 4. Birthmarks can be program structure.  More static among malware variants.  Birthmarks can be compared using “approximate similarity”.  Able to detect unknown samples that are suspiciously similar to known malware.  Vastly reduce number of required signatures.
  • 5. Program p Birthmark MATCH! Similar? Program q Birthmark Different
  • 6. Control flow is more invariant among polymorphic and metamorphic malware.  A directed graph representing control flow.  A control flow graph for every procedure.  One call graph per program.
  • 7. lea 0x4(%esp),%ecx and $0xfffffff0,%esp Proc_0 pushl -0x4(%ecx) push %ebp mov %esp,%ebp push %ecx sub $0x24,%esp call 4011b0 <___main> movl $0x0,-0x8(%ebp) jmp 40115f <_main+0x2f> Proc_1 Proc_3 movl $0x4020a0,(%esp) call 4011b8 <_puts> addl $0x1,-0x8(%ebp) cmpl $0x9,-0x8(%ebp) Proc_4 jle 40114f <_main+0x1f> add $0x24,%esp pop %ecx pop %ebp Proc_2 lea -0x4(%ecx),%esp ret
  • 8. Known as the “Graph Isomorphism” problem.  Identifies equivalent “structure”.  Not proven to be in NP, but no polynomial time algorithm known.
  • 9. The number of basic operations applied to a graph to transform it to another graph.  If you know the distance between two objects, you know the similarity.  Complexity in NP and infeasible.
  • 10. proc(){ L_0 L_0: W|IEH}R while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: } L_1 L_7 L_5: true } true L_7: return; L_2 L_4 } true L_5
  • 11. Input is a string.  Extract all substrings of fixed size Q.  Substrings are known as q-grams.  Let’s take q-grams of all decompiled graphs. W|IE |IEH W|IEH}R IEH} EH}R
  • 12. An array <E1,...,En>  A feature vector describes the number of occurrences of each feature.  En is the number of times feature En occurs.  Let’s make the 500 most common q-grams as features.  We use feature vectors as birthmarks.
  • 13. A vector is an n-dimensional point.  E.g. 2d vector is <x,y>  Fast.
  • 14. Software similarity problem extended to similarity search over a database.  Find nearest neighbours (by distance) of a query.  Or find neighbours within a distance of the query.
  • 15. Query Benign r q d(p,q) p Query Malicious Query Malware
  • 16. Vector distances here are “metric”.  It has the mathematical properties of a metric.  This means you can do a nearest neighbour search without brute forcing the entire database!
  • 17. System is 100,000 lines of code of C++.  The modules for this work < 3000 lines of code.  System translates x86 into an intermediate language (IL).  Performs analysis on architecture independent IL.  Unpacks malware using an application level emulator.
  • 18. Database of 10,000 malware.  Scanned 1,601 benign binaries.  10 false positives. Less than 1%.  Using additional refinement algorithm, reduced to 7 false positives.  Very small binaries have small signatures and cause weak matching.
  • 19. Calculated similarity between Roron malware variants.  Compared results to Ruxcon 2010 work.  In tables, highlighted cells indicates a positive match.  The more matches the more effective it is.
  • 20. ao b d e g k m q a ao b d e g k m q a ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47 ao 0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75 b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58 b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87 d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27 d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29 e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27 e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33 g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27 g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30 k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75 k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96 m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87 q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58 q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87 a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58 a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87 Exact Matching Heuristic Approximate (Ruxcon 2010) Matching (Ruxcon 2010) ao b d e g k m q a ao 0.86 0.53 0.64 0.59 0.86 0.86 0.86 0.86 b 0.88 0.66 0.76 0.71 0.97 1.00 1.00 0.97 d 0.65 0.72 0.88 0.93 0.73 0.72 0.72 0.73 e 0.72 0.80 0.87 0.93 0.80 0.80 0.80 0.80 g 0.69 0.77 0.93 0.93 0.77 0.77 0.77 0.77 k 0.88 0.97 0.67 0.77 0.72 0.97 0.97 0.99 m 0.88 1.00 0.66 0.76 0.71 0.97 1.00 0.97 q 0.88 1.00 0.66 0.76 0.71 0.97 1.00 0.97 a 0.87 0.97 0.67 0.77 0.72 0.99 0.97 0.97 Q-Grams
  • 21. Faster than Ruxcon 2010.  Median benign processing time is 0.06s.  Median malware processing time is 0.84s.  Slowest result may be memory thrashing. % Benign Malware Samples Time(s) Time(s) 10 0.02 0.16 20 0.02 0.28 30 0.03 0.30 40 0.03 0.36 50 0.06 0.84 60 0.09 0.94 70 0.13 0.97 80 0.25 1.03 90 0.56 1.31 100 8.06 585.16
  • 22. Improved effectiveness and efficiency compared to Ruxcon 2010.  Runs in real-time in expected case.  Large functional code base and years of development time.  Happy to talk to vendors.
  • 23. Full academic paper at IEEE Trustcom.  Research page http://www.foocodechu.com  Book on “Software similarity and classification” available in 2012.  Wiki on software similarity and classification http://www.foocodechu.com/wiki