Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- 新浪内部对腾讯公司的深度解析 by Vianne Cai 2856 views
- Auditing the Opensource Kernels by Silvio Cesare 1594 views
- Wire - A Formal Intermediate Langua... by Silvio Cesare 1480 views
- Detecting Bugs in Binaries Using De... by Silvio Cesare 2715 views
- 異種・協調・分散ロボットに関する研究 by haganemetal 2250 views
- 微博合作介绍 V0.2 by Vianne Cai 2237 views

1,199 views

Published on

Published in:
Technology

No Downloads

Total views

1,199

On SlideShare

0

From Embeds

0

Number of Embeds

68

Shares

0

Downloads

13

Comments

0

Likes

1

No embeds

No notes for slide

- 1. A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS Silvio Cesare, Deakin University
- 2. Introduction Started off in industry (Qualys, now Volvent). Have a Masters by Research. About to receive a PhD from Deakin University. Last 5 years in post-graduate University research. Learnt some cool things along the way.
- 3. What did I do at University? Malwise v1 (Masters) Malwise v2 Binary comparison and visualization service. Clonewise Binary clustering service. Simseer More improved malware variant search service. Simseer Cluster Improved version. Simseer Search Malware variant detection system. Automated detection of embedded libraries in source. Bugalyze Detection of bugs using data flow analysis.
- 4. Outline Mathematical Objects Comparing Similarity Searching Classification Clustering Program Analysis
- 5. An incomplete list of mathematical objects Strings Vectors Sets Sets of Objects Trees Graphs
- 6. Objects Objects have different performance. Example Comparing two vectors is fairly fast. Exact matching two strings is fairly fast. Inexact matching two strings is medium slow/fast. Comparing two graphs is slow. A K T KT K | | | | | sequence alignment O(mn) A TK TT T K
- 7. Transforming one object to another Problem Comparing two 100kb strings using the edit distance is impractically slow. Solution ed(“hello”, “ggello”) = 2 Transform the strings into vectors. Then, use a vector comparison – which is fast. Examples Comparing malware samples Finding near duplicate web pages Comparing E-Mails
- 8. N-Grams Extract all N-length substrings (N-Grams) from original string. From training set of strings, choose best N-Grams. Each unique N-Gram is an index in a vector. The value of the element is the number of times it occurs. W|IEH}R W|IE |IEH IEH} EH}R
- 9. Another N-Gram example Extract N-Grams Represent new object as a ‘Set of N-Grams’ Compare sets using set similarity metrics
- 10. A Graph problem Graph problems like approximate similarity are slow to solve. Decompose graph into subgraphs of at most k-nodes. Canonicalize small graphs, represent by adjacency matrix, transform to string. Graph is now a ‘Set of Strings’. Optionally represent as vector of ‘important ksubgraphs’. Use Vector distance metrics to compare, index, and search.
- 11. K-subgraph decomposition L_0 L_0 L_3 L_3 L_3 L_3 L_6 L_6 true L_0 L_6 L_6 L_1 L_1 L_7 true L_1 L_7 L_1 L_4 L_2 L_4 L_2 L_4 L_7 true L_2 L_7 L_2 true L_4 L_5 true L_5 L_0 L_5 L_3 L_6 0101000 0000000 0000010 0010100 0000010 0000001 1001000 0001010 0000000 1000000 0000100 0010000 0101000 1000000 0000001 0000100 0000001 0010000 0001010 0010000 0100100 L_1 L_2 L_4 L_5
- 12. Graphs – Case Study Implemented in Malwise and Simseer Take control flow graphs of programs. Decompile into strings. One: Consider program as a vector of N-Grams of decompiled strings. L_0 Two: L_3 true Consider program as a set of strings. L_6 true L_1 L_7 true L_2 L_4 true L_5 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; }
- 13. Final Remarks on Objects Know how to represent your problem. Look into how the representation can be approximated By transforming it into another object Vectors are often a good choice.
- 14. Comparing Problem Measure the similarity (or distance between) two objects. Solution Represent objects mathematically. Use multitude of mathematical measures. Examples Malware similarity Near duplicate web pages
- 15. Comparing Sets A set is a collection of elements. Given an equality function between elements, we can measure set similarity. Inexact matching index Dice coefficient Jaccard s 2 A B AB J ( A, B) A B A B
- 16. Comparing Vectors – Ugh, math. Euclidean Distance d ( p, q ) (qi pi) n 2 i 1 Manhattan Distance n d ( p, q ) q i 1 Cosine Similarity i similarity cos( ) p i A B A B
- 17. Vector distance – a different look A vector is an n-dimensional point in space. E.g., a 2-d vector is <x,y>
- 18. Cosine similarity Line from origin to n-dimensional point. Given 2 lines, what’s the angle (theta) between them? The smaller the angle, the more similar. Point A Point B Theta
- 19. Comparing Vectors – Case Study Malwise v2 Feature vector of N-Grams of decompiled flowgraphs Manhattan Distance Simseer Search Same feature vector Euclidean Distance
- 20. Comparing Sets – Case Study Malwise v1 An element is a graph invariant of the control flow graph, represented as an integer. A program is a set of integers. Compare similarity between two programs using Dice coefficient.
- 21. Malwise v1 - Comparing Sets 1 T F 2 (1 -> 2), (1 -> 4) (2 -> 3), () (), () (4 -> 3), () 4 T T 3 s ( A, B) 2 wi x Ai Bi i w x A w x B i i i i i i
- 22. Comparing Sets of Strings in Malwise v2 – Case Study String is a decompiled flowgraph. Program is a set of strings. Edit distance between strings. Construct 1:1 mapping between elements of sets: Such that the sum of distances is minimized. Solved using ‘combinatorial optimisation’ Assignment Problem Solution by “graph matching”
- 23. Malwise v2 - Comparing Sets of Strings L_0 L_3 true L_6 true L_1 L_7 true L_2 L_4 true proc(){ L_0: while (v1 || v2) { L_1: if (v3) { L_2: } else { L_4: } L_5: } L_7: return; } W|IEH}R true L_5 p BR BW|{B}BR BI{B}BR BSSR BSR BSSSR BR BW|{B}BR BSSR d=ed(p,q) q
- 24. Final Remarks on Comparing Inexact matching is your friend. Try to use known distance metrics. They have useful properties and index better. If it’s too slow to compare, transform the object.
- 25. Similarity Searching Problem Find all ‘similar’ objects to my query in a database Example Find all words in a dictionary with at most 3 differences to my query word. This problem is known as a ‘similarity search’ Solution Naive exhaustive search. Better to use ‘Metric Trees’
- 26. Similarity Search Constraints Variations K-nearest neighbours – the k closests objects to the query. All objects within a specific distance to the query. Search based on using a ‘metric distance’. Metric distances satisfy mathematical properties. Examples Euclidean Distance Jaccard Distance Cosine Distance is not metric
- 27. Searching – Case Study Malwise v2 Distance metric is Manhattan Distance. Use VP-Trees to index and search in stage 1. Use DBM-Trees to index and search in stage 2. Implemented using open source GBDI Arboretum library. Query Benign r q d(p,q) p Query Malicious Query Malware
- 28. Final Remarks on Searching Searching for inexact matches is useful. Use good distance metrics. Use open source libraries.
- 29. Classification The problem: Given a set of N classes. And a query object. Assign one of the classes to the object. Class A Class B Examples Is this binary (malicious, not malicious)? Is this gmail email (primary, social, promotional)? Is this web page (defaced, not defaced)?
- 30. Classification Methodology Supervised Learning Given a training set of objects labelled by their class. Build a model. Then use the model to classify unknown objects. Unsupervised Learning No labelled data exists. “Cluster” objects into classes. Use clusters to train model. Then classify as per-normal.
- 31. Classification – What do I have to do? Represent objects using “feature vectors” A vector is an array. Each element represents a “feature”. The value of the element tends to be a count of something, or a size. Feature examples The number of times a dictionary word such as “Hello” appears in an E-Mail. The size of a binary. The number of times LoadLibraryA is executed.
- 32. Classification – WEKA? Put the feature vectors into the text-based ARFF file format. Plug into the WEKA machine learning toolkit. Experiment with different classifiers. Part of your labelled data can be used to evaluate the accuracy.
- 33. Weka ARFF file @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,?
- 34. WEKA 10/25/2013 University of Waikato 34
- 35. Classification – Case Study Clonewise Feature vector is set of features extracted from a pair of packages. Classify - do these packages share code (yes, no)? Classify – is the 1st package embedded in the 2nd package (yes, no)?
- 36. Final Remarks on Classification Lots of problems can be considered as this. Learn how to use WEKA. Vectors are very good representations.
- 37. Clustering Problem To group together “similar” objects under some notion of similarity. Easy solution Represent objects using “feature vectors”. Plug into WEKA. Packages in Fedora Linux
- 38. Clustering - Case Study Simseer Cluster Represent binaries using N-Grams of decompiled flowgraphs. Use most frequent N-Grams as features. Distance measure is cosine distance.
- 39. Final Remarks on Clustering A classic machine learning problem. Again, learn to use WEKA.
- 40. Program Analysis An incredibly large and deep field. This section skims the surface. Main approaches Proving Model Checking Abstract Interpretation Data Flow Analysis Theorem
- 41. Model Checking Looks at program states generated by a program. Some states indicate bugs. Try BLAST, a model checker for small C programs. Caveat - it’s pretty old now.
- 42. Theorem Proving - SMT SMT – what is it? An equation solver that covers the types of operations seen in machine code. Approach for Bug Detection User input can be anything generally, so treat this as a “symbolic” variable. The rest is concrete. Simulate execution of the program, plugging all the machine code that is executed into the solver formuli. Concolic execution Combining symbolic execution with concrete execution.
- 43. Concolic Execution At branches, can we have user input that forces us to go down each path? Use the SMT solver to tell us. Launch execution down ‘feasible’ paths. Use the solver to tell us if bugs are present. What user input, if any, can make this pointer NULL?
- 44. Concolic path-sensitive analysis lea 0x4(%esp ),%ecx and $0 xfffffff,%esp 0 pushl -0x4(%ecx ) push %ebp mov %esp ,%ebp push %ecx sub $0x24,%esp call 4011 0 <___main b > movl $0x0,-0x8(%ebp ) jmp 40115f <_main +0x2f> 1 movl $0x4020 a0,(%esp ) 4011 call b 8 <_puts > addl $0x1,-0x8(%ebp ) 3 cmpl $0x9,-0x8(%ebp ) jle 40114f <_main +0x1f> 2 add pop pop lea ret $0x24,%esp %ecx %ebp -0x4(%ecx ),%esp 4 2
- 45. Abstract Interpretation Abstract the execution of the program. Example Only consider the sign of a variable, not the actual value. Requires a transfer function What an instruction does to the abstract data. And a Join/Meet function How data is combined when it meets from different control flow.
- 46. Data Flow Analysis Similar to abstract interpretation. Uses a transfer function, a join. Implement both using a monotone framework. Data Flow analysis is used by compilers. Classic data flow problems The reach of defining or assigning to a variable. Knowing if a variable will be read again before being assigned a new value.
- 47. Data Flow Analysis – Case Study Implemented in Bugalyze. Example bug detection In free(ptr), where is ptr used before it is reassigned, and is it used in a free? Has found real bugs in Debian Linux. Still a work-in-progress.
- 48. Bugalyze – Case Study
- 49. Final Remarks on Program Analysis A wide and deep field. Good to know the basic approaches. Reversing is becoming more rigourous (think HexRays).
- 50. Conclusion Academia has some useful techniques. It’s good to know some of the basic methods. Will improve industrial programs. Any questions?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment