Upcoming SlideShare
×

A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

1,199 views

Published on

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,199
On SlideShare
0
From Embeds
0
Number of Embeds
68
Actions
Shares
0
13
0
Likes
1
Embeds 0
No embeds

No notes for slide

A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS

1. 1. A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS Silvio Cesare, Deakin University
2. 2. Introduction      Started off in industry (Qualys, now Volvent). Have a Masters by Research. About to receive a PhD from Deakin University. Last 5 years in post-graduate University research. Learnt some cool things along the way.
3. 3. What did I do at University?  Malwise v1 (Masters)   Malwise v2   Binary comparison and visualization service. Clonewise   Binary clustering service. Simseer   More improved malware variant search service. Simseer Cluster   Improved version. Simseer Search   Malware variant detection system. Automated detection of embedded libraries in source. Bugalyze  Detection of bugs using data flow analysis.
4. 4. Outline       Mathematical Objects Comparing Similarity Searching Classification Clustering Program Analysis
5. 5. An incomplete list of mathematical objects       Strings Vectors Sets Sets of Objects Trees Graphs
6. 6. Objects   Objects have different performance. Example  Comparing two vectors is fairly fast.  Exact matching two strings is fairly fast.  Inexact matching two strings is medium slow/fast.  Comparing two graphs is slow. A K T KT K | | | | | sequence alignment O(mn) A TK TT T K
7. 7. Transforming one object to another  Problem  Comparing two 100kb strings using the edit distance is impractically slow.  Solution ed(“hello”, “ggello”) = 2  Transform the strings into vectors.  Then, use a vector comparison – which is fast.  Examples  Comparing malware samples  Finding near duplicate web pages  Comparing E-Mails
8. 8. N-Grams     Extract all N-length substrings (N-Grams) from original string. From training set of strings, choose best N-Grams. Each unique N-Gram is an index in a vector. The value of the element is the number of times it occurs. W|IEH}R W|IE |IEH IEH} EH}R
9. 9. Another N-Gram example    Extract N-Grams Represent new object as a ‘Set of N-Grams’ Compare sets using set similarity metrics
10. 10. A Graph problem       Graph problems like approximate similarity are slow to solve. Decompose graph into subgraphs of at most k-nodes. Canonicalize small graphs, represent by adjacency matrix, transform to string. Graph is now a ‘Set of Strings’. Optionally represent as vector of ‘important ksubgraphs’. Use Vector distance metrics to compare, index, and search.
11. 11. K-subgraph decomposition L_0 L_0 L_3 L_3 L_3 L_3 L_6 L_6 true L_0 L_6 L_6 L_1 L_1 L_7 true L_1 L_7 L_1 L_4 L_2 L_4 L_2 L_4 L_7 true L_2 L_7 L_2 true L_4 L_5 true L_5 L_0 L_5 L_3 L_6 0101000 0000000 0000010 0010100 0000010 0000001 1001000 0001010 0000000 1000000 0000100 0010000 0101000 1000000 0000001 0000100 0000001 0010000 0001010 0010000 0100100 L_1 L_2 L_4 L_5
12. 12. Graphs – Case Study     Implemented in Malwise and Simseer Take control flow graphs of programs. Decompile into strings. One:  Consider program as a vector of N-Grams of decompiled strings.  L_0 Two: L_3 true  Consider program as a set of strings. L_6 true L_1 L_7 true L_2 L_4 true L_5 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; }
13. 13. Final Remarks on Objects   Know how to represent your problem. Look into how the representation can be approximated  By  transforming it into another object Vectors are often a good choice.
14. 14. Comparing  Problem  Measure the similarity (or distance between) two objects.  Solution  Represent objects mathematically.  Use multitude of mathematical measures.  Examples  Malware similarity  Near duplicate web pages
15. 15. Comparing Sets    A set is a collection of elements. Given an equality function between elements, we can measure set similarity. Inexact matching index  Dice coefficient   Jaccard  s 2 A B AB J ( A, B)  A B A B
16. 16. Comparing Vectors – Ugh, math.  Euclidean Distance  d ( p, q )   (qi  pi) n 2 i 1  Manhattan Distance  n d ( p, q )   q  i 1  Cosine Similarity  i similarity  cos( )  p i A B A B
17. 17. Vector distance – a different look   A vector is an n-dimensional point in space. E.g., a 2-d vector is <x,y>
18. 18. Cosine similarity    Line from origin to n-dimensional point. Given 2 lines, what’s the angle (theta) between them? The smaller the angle, the more similar. Point A Point B Theta
19. 19. Comparing Vectors – Case Study  Malwise v2  Feature vector of N-Grams of decompiled flowgraphs  Manhattan Distance  Simseer Search  Same feature vector  Euclidean Distance
20. 20. Comparing Sets – Case Study     Malwise v1 An element is a graph invariant of the control flow graph, represented as an integer. A program is a set of integers. Compare similarity between two programs using Dice coefficient.
21. 21. Malwise v1 - Comparing Sets 1 T  F 2 (1 -> 2), (1 -> 4) (2 -> 3), () (), () (4 -> 3), () 4 T T 3 s ( A, B)  2 wi x Ai  Bi i w x A  w x B i i i i i i
22. 22. Comparing Sets of Strings in Malwise v2 – Case Study     String is a decompiled flowgraph. Program is a set of strings. Edit distance between strings. Construct 1:1 mapping between elements of sets:  Such  that the sum of distances is minimized. Solved using ‘combinatorial optimisation’  Assignment Problem  Solution by “graph matching”
23. 23. Malwise v2 - Comparing Sets of Strings L_0 L_3 true L_6 true L_1 L_7 true L_2 L_4 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; } W|IEH}R true L_5 p BR BW|{B}BR BI{B}BR BSSR BSR BSSSR BR BW|{B}BR BSSR d=ed(p,q) q
24. 24. Final Remarks on Comparing   Inexact matching is your friend. Try to use known distance metrics.  They  have useful properties and index better. If it’s too slow to compare, transform the object.
25. 25. Similarity Searching  Problem  Find  all ‘similar’ objects to my query in a database Example  Find all words in a dictionary with at most 3 differences to my query word.   This problem is known as a ‘similarity search’ Solution  Naive exhaustive search.  Better to use ‘Metric Trees’
26. 26. Similarity Search Constraints  Variations  K-nearest neighbours – the k closests objects to the query.  All objects within a specific distance to the query.    Search based on using a ‘metric distance’. Metric distances satisfy mathematical properties. Examples  Euclidean Distance  Jaccard Distance  Cosine Distance is not metric
27. 27. Searching – Case Study  Malwise v2  Distance metric is Manhattan Distance.  Use VP-Trees to index and search in stage 1.  Use DBM-Trees to index and search in stage 2.  Implemented using open source GBDI Arboretum library. Query Benign r q d(p,q) p Query Malicious Query Malware
28. 28. Final Remarks on Searching    Searching for inexact matches is useful. Use good distance metrics. Use open source libraries.
29. 29. Classification  The problem:  Given a set of N classes.  And a query object.  Assign one of the classes to the object.  Class A Class B Examples  Is this binary (malicious, not malicious)?  Is this gmail email (primary, social, promotional)?  Is this web page (defaced, not defaced)?
30. 30. Classification Methodology  Supervised Learning  Given a training set of objects labelled by their class.  Build a model.  Then use the model to classify unknown objects.  Unsupervised Learning  No labelled data exists.  “Cluster” objects into classes.  Use clusters to train model.  Then classify as per-normal.
31. 31. Classification – What do I have to do?      Represent objects using “feature vectors” A vector is an array. Each element represents a “feature”. The value of the element tends to be a count of something, or a size. Feature examples  The number of times a dictionary word such as “Hello” appears in an E-Mail.  The size of a binary.  The number of times LoadLibraryA is executed.
32. 32. Classification – WEKA?     Put the feature vectors into the text-based ARFF file format. Plug into the WEKA machine learning toolkit. Experiment with different classifiers. Part of your labelled data can be used to evaluate the accuracy.
33. 33. Weka ARFF file @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,?
34. 34. WEKA 10/25/2013 University of Waikato 34
35. 35. Classification – Case Study  Clonewise  Feature vector is set of features extracted from a pair of packages.  Classify - do these packages share code (yes, no)?  Classify – is the 1st package embedded in the 2nd package (yes, no)?
36. 36. Final Remarks on Classification    Lots of problems can be considered as this. Learn how to use WEKA. Vectors are very good representations.
37. 37. Clustering  Problem  To group together “similar” objects under some notion of similarity.  Easy solution  Represent objects using “feature vectors”.  Plug into WEKA.  Packages in Fedora Linux 
38. 38. Clustering - Case Study  Simseer Cluster  Represent binaries using N-Grams of decompiled flowgraphs.  Use most frequent N-Grams as features.  Distance measure is cosine distance.
39. 39. Final Remarks on Clustering   A classic machine learning problem. Again, learn to use WEKA.
40. 40. Program Analysis    An incredibly large and deep field. This section skims the surface. Main approaches Proving   Model Checking   Abstract Interpretation  Data Flow Analysis   Theorem
41. 41. Model Checking    Looks at program states generated by a program. Some states indicate bugs. Try BLAST, a model checker for small C programs.  Caveat - it’s pretty old now.
42. 42. Theorem Proving - SMT  SMT – what is it?   An equation solver that covers the types of operations seen in machine code. Approach for Bug Detection User input can be anything generally, so treat this as a “symbolic” variable.  The rest is concrete.  Simulate execution of the program, plugging all the machine code that is executed into the solver formuli.   Concolic execution  Combining symbolic execution with concrete execution.
43. 43. Concolic Execution     At branches, can we have user input that forces us to go down each path? Use the SMT solver to tell us. Launch execution down ‘feasible’ paths. Use the solver to tell us if bugs are present.  What user input, if any, can make this pointer NULL?
44. 44. Concolic path-sensitive analysis lea 0x4(%esp ),%ecx and \$0 xfffffff,%esp 0 pushl -0x4(%ecx ) push %ebp mov %esp ,%ebp push %ecx sub \$0x24,%esp call 4011 0 <___main b > movl \$0x0,-0x8(%ebp ) jmp 40115f <_main +0x2f> 1 movl \$0x4020 a0,(%esp ) 4011 call b 8 <_puts > addl \$0x1,-0x8(%ebp ) 3 cmpl \$0x9,-0x8(%ebp ) jle 40114f <_main +0x1f> 2 add pop pop lea ret \$0x24,%esp %ecx %ebp -0x4(%ecx ),%esp 4 2
45. 45. Abstract Interpretation   Abstract the execution of the program. Example  Only consider the sign of a variable, not the actual value.  Requires a transfer function  What  an instruction does to the abstract data. And a Join/Meet function  How data is combined when it meets from different control flow.
46. 46. Data Flow Analysis  Similar to abstract interpretation.  Uses a transfer function, a join.  Implement both using a monotone framework.   Data Flow analysis is used by compilers. Classic data flow problems  The reach of defining or assigning to a variable.  Knowing if a variable will be read again before being assigned a new value.
47. 47. Data Flow Analysis – Case Study   Implemented in Bugalyze. Example bug detection  In free(ptr), where is ptr used before it is reassigned, and is it used in a free?   Has found real bugs in Debian Linux. Still a work-in-progress.
48. 48. Bugalyze – Case Study
49. 49. Final Remarks on Program Analysis    A wide and deep field. Good to know the basic approaches. Reversing is becoming more rigourous (think HexRays).
50. 50. Conclusion     Academia has some useful techniques. It’s good to know some of the basic methods. Will improve industrial programs. Any questions?