From DNA Sequence Variation to .NET Bits and Bobs

920 views

Published on

Mathieu Letourneau, Andrei Saygo, Eoin Ward, Microsoft

This talk will present our research project on .Net file clustering based on their respective basic blocks and the parallel that can be made with DNA sequence variation analysis. We implemented a system that extracts the basic blocks on each file and creates clusters based on them. We also developed an IDA plugin to make use of that data and speed up our analysis of .Net files.

Andrei Saygo, Eoin Ward and Mathieu Letourneau all work as Anti-Malware Security Engineers in the AM Scan team of Microsoft’s Product Release & Security Services group in Dublin, Ireland.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
920
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
13
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • First we have to talk a bit about how DNA is organised. As you may already know, the DNA is made of 4 bases (A – adenine, T – thymine, C – cytosine and G – guanine). Three of these bases join together form an amino acid.
    There must be a precise order or sequence in the DNA, because this is used to make proteins that are responsible for most of the things that happen in our bodies.
    The DNA sequence that codes for protein is known as a gene.
  • As a whole, the human genome is comprised of about 3 billion base pairs. I’ve said base pairs because DNA has two complementary strands. Each nucleotide can pair up only with it’s complement (Adenine can be paired only with Thymine and Cytosine only with Guanine).
    Now, everything is nice when things are working properly, but due to various factors mutations appear and the DNA sequence changes, it’s dynamic.
    Today we will discuss the single nucleotide polymorphism (aka SNPs). It’s the most common type of genetic variation and is represented by a change in a single nucleotide. For example Adenine changes to Cytosine.
    These kind of changes may actually be involved in various diseases, because now the sequence that codes for a particular protein has changed and so the gene function is changed. Researchers are using these SNPs as biological markers to locate genes that are associated with disease.
  • One way the scientists are searching for bio markers is by using the genome-wide association study (aka GWAS) approach. They scan hundreds of human genomes from both healthy people (aka controls) and carriers and they are looking for SNPs that may predict the presence of a disease.
    The results are then displayed in a scatter plot, called a Manhattan plot because it resembles Manhattan’s skyline (kind of).
    The peaks indicate the SNPs that are most likely associated with a specific disease. For example, in this graph 600k SNPs were tested for more than1400 Japanese suffering of atopic dermatitis (an itchy skin disorder) and almost 8000 controls. The peaks indicate SNPs that are associated with that disease.
  • Now we know why the DNA sequence is important and that any change in the nucleotides can modify the instructions that code for protein.
    In a way it’s similar to having a sequence of opcodes where any change can modify some instruction and the result can be a different behaviour of a program.
    Based on this similarity we built a .NET disassembler and developed a clustering algorithm that allows us to identify malicious code.
  • We are going to have a quick overview of the .NET file structure and see how do we actually get to the code.
    From the IMAGE_DATA_DIRECTORY of a .NET file we get to the CLR_HEADER that holds a similar type of structure for the MetaData. There we can get the number of streams and immediately after we have the structure for each stream. The STREAM_HEADER structure contains the name, offset and size for each stream.
  • Out of all streams we are interested in the metadata stream. This stream contains the Method Definition table that can get us to the code for each method.
    Each Method Definition structure that follows the metadata stream contains a virtual address (RVA) which points to the first instruction for each method defined in the executable.
  • Using the Common Intermediate Language specs we developed the disassembler. This one starts at the beginning of each method and continues to disassemble the current method until we reach the RET instruction. Afterwards we split everything into basic blocks (a set of instructions that contains one entry point and one exit point so the instructions are executed exactly once, in order). From each instruction in a basic block we extract only the first operand and once we have all of them, we do a CRC that will be later added into a database.
  • There’s 2 things to consider when we want to do clustering: Features to cluster on and the distance measure to use.
    In our case, in line with what Andrei just explained, we used the CRCID of each FOPS in our database.
    Basically all the FOPS are linked to their CRC hash / length in our database and therefore all have a unique CRCID.

    We build a list of CRCID for each, so they can be represent as an array like we see here.

    This intentionally keeps the feature representation generic so that we could end up using any other kind of feature to test different clustering approach without having the change the clustering engine itself.

    As for the distance measure, we used a derivate of the Jaccard Index. Instead of divided the size of the intersection by the size of the union of two sets, we divide the size of the smallest set by the size of the union of both sets.

    This gives us a similarity value between 0 and 1, then we subtract that to 1 to get a distance measure.

    You can see the basic function there with a simple implementation in C#.

  • The speed of the getDistance operation will of course vary depending on the amount of FOPS in each file, meaning the amount of CRCID in each arrays.
    Based on various tests, we can assume an average speed of 0.01s per distance computation.
    A normal simplistic implementation of clustering where we want to calculate the distance between every pair of files before then using a different clustering algorithm on this NxN matrix of distances would give a complexity of n-squared.

    For example, if we have to cluster 1500 files we would get 1500 square * 0.01 = 22500 seconds .. 6.25 hours.
    That clearly doesn’t scale well and as a result…
    ..we get some very bored analysts
  • Luckily, we don’t need to calculate the distance for each pair of file.

    We implemented a few mitigation techniques that help us improve the speed considerably.

    First, we load all the files in memory and order them amount of Fops they contain. By loading the files, I mean loading the arrays of CRCIDs associated with each files.

    Then, we compute the distance of two files only if their size ratio is within the threshold value. For example, if we want clusters with a maximum distance of 30% (0.30), the biggest a the two file can be maximum 30% bigger than the smallest one for us to bother computing the distance for these two files. This is possible because the file size ratio is directly used in the distance measure, so we know in advance that a file size ratio bigger than the threshold will never return a distance measure within the threshold.

    Also, we support agglomerative clustering, meaning we can periodically add new files to the previously known clusters or create new ones if needed. This could be daily, hourly or every time a new file enters our system. To avoid having to reset the clusters each time, we use prototypes to represent each cluster and we will only compare the new file to the prototypes of previously created clusters. We define as prototype of a cluster the smallest (fewest amount of Fops/CRCID) file it contain. You can see it as the smallest common denominator.
  • Here’s a basic overview of the clustering algorithm:

    We first load all the Files IDs and their respective CRCIDs in a dictionary.

    We then sort that dictionary by amount of CRCIDs each file contain. Smallest file first.
    This operation is very quick and will be the stepping stone of most of the time savings done later.

    We can take the first file (File1) in the dictionary and put it in a new cluster.
    - We then loop on every subsequent files until the size ratio gets greater than the threshold value and calculate their distance to the selected file.
    - If we obtain a distance within the threshold value on a valid file (File2), we put it in the same cluster as File1 and remove it from the dictionary of files to cluster.
    Once we reach the end of the possible matches for File1, we remove it from the dictionary and go back to the third step until all files have been checked.



  • Now here’s this algorithm explained with an animation.

    Let’s say we want to cluster these 8 files with a distance threshold of 30% (or 0.30).

    First step is to order them by size (or amount of Fops the file contain).
  • Then we can start clustering..
  • We ran a few tests to see the impact of the threshold value on the overall speed of clustering.

    We can see here that with a “loose” threshold of 80%, the speed curve is still looking exponential but is leaning towards a more linear form.

    The same 1500 files from the first example are now taking 760 seconds to cluster.

    Then, with a more realistic threshold of 20% we can see massive improvement in terms of speed, which is now almost linear.

    The set of 1500 files are now taking 14 seconds to cluster.

    This slide demonstrates well the impact of the mitigation technique we implemented to speed things up.
  • Just to recap on the initial example of clustering on the set of 1500 files that was originally taking 22500 seconds with the simplistic approach in n-square complexity.

    With the mitigation techniques and a threshold value of 80% it takes 760 seconds to cluster, whereas it only take 14 seconds with a threshold of 20%.

  • If you remember, in the first part of the presentation we talked about how scientists are using the Manhattan plot to identify the SNPs or bio markers used to predict a disease. Based on that approach and using the data that we’ve collected through our clustering algorithm, we’ve devised a similar technique.
    Here is an example of how starting from a known malicious file that is part of a cluster of a few hundred files, we managed to identify only the ones that are malicious and even more, how we identified the code that is unique to the malicious files.

    First we’ve got the list of CRCs that are present in the target file and for each one we’ve counted the number of files that contain the CRC
    The second step was to calculate the median and remove everything that’s above based on the assumption that most prevalent CRCs are clean.
    For each remaining CRC, using a formula similar to a chi-squared distribution with k degrees of freedom, we’ve calculated the probability that the CRC is malicious.
  • But before we show you the results, here is an example using the same approach on a real set of genetic data, comprised of 200k SNPs. We can easily spot a few peaks that show that they are relevant to the disease.
  • In our set of 285 files we got a similar result. A few CRCs (which are actually basic blocks) were identified as possibly malicious. In order to verify our result, we’ve queried the database to get the files that contain those CRCs. The query returned another 9 files and after the analysis we saw that they are part of the same malware family.
    We haven’t stopped here though. In the next slides we will see how we developed a plugin for IDA to help and speed up our analysis.
  • The Idea behind the IDA Python Plugin is to use the information generated by the clustering to assist analysis.
    The plugin does this by calculating the basic blocks and then the fops of these basic blocks.
    It then takes these information and poles the database on the determinations on these basic blocks.
    It uses these determinations to colour the basic block and also adds a comment that gives the breakdown of the determination.
    Finally it appends the comment with the fop length and Fop CRC this can be used to get the md5 of files with this basic block by polling the database.
    You can then use these files for comparative analysis.
  • This are the basic steps.
    First Pass -> we retrieve the basic blocks, FOPs and calculate the length and CRCs on the FOPs.
    Poll the database for the comments and for each CRC we get the determination for all the files that contain it.
    Second Pass -> recalculate the basic block and this time colour and comment each basic block.
  • If you have looked at the dot net Intermediate Language you will constantly see pushes and pops to the stack.
    This is the evaluation stack and it is used to managing state information in .Net similar to how x86 uses registers.
    It is best to really think of CLR as a stack machine.
    In this vain .NET uses the maxstack function to create space on the stack for its calculations.
    The space created is not so relevant just that there is enough space.
    For that reason the maxstack instruction is not that relevant, to the programs functionality so IDA Pro does not disassemble it.
    Our native C disassembler did parse it, so we had discrepancies between or IDA plugin output and our native C output.
    After some investigation we discover that there is little value in the maxstack instructions with regard to the functionality of the basic block.
    So we Ignore them and we also Ignore .net “NOP”’s for similar reasons.

    The second major issue came with the time taken to query the database.
    To improve performance we added a table to give us a single query for each CRC.
    We believe we can improve the performance further with store procedures and an ordering of this table, as well as improvement in the python code itself.
  • From DNA Sequence Variation to .NET Bits and Bobs

    1. 1. Talk outline
    2. 2. About us We analyse files on a daily basis to determine if they are malicious and that includes Windows 8 Apps and Windows Phone apps. For the past few years we have been involved in fields like bioinformatics, molecular biology and genetics allowing us to extrapolate some of the ideas/algorithms used in the bio field and apply them to malware classification and detection purposes. About us About DNA .NET disassembler Clustering IDA plugin
    3. 3. About DNA - DNA is made of four chemical building blocks called nucleotides: adenine (A), thymine (T), cytosine (C) and guanine (G). - A three-nucleotide series (called codon) in a DNA sequence specifies a single amino acid. - The DNA sequences are translated to amino acids that produce proteins. - Each DNA sequence that contains instructions to make a protein is known as a gene. About us About DNA .NET disassembler Clustering IDA plugin moleculesoflife2010.wikispaces.com/Protein+Structure
    4. 4. About DNA sequence variation The human genome comprises about 3 billion base pairs of DNA. Due to various factors, mutations occur so the DNA sequence may change. Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block. They can act as biological markers, helping scientists locate genes that are associated with disease.. About us About DNA .NET disassembler Clustering IDA plugin
    5. 5. About GWAS A genome-wide association study (GWAS) is an approach used in genetics research to associate specific genetic variations with particular diseases. The method involves scanning the genomes (1 million SNPs) from many different people (healthy and carriers) and looking for genetic markers that can be used to predict the presence of a disease. The results of a GWAS are often displayed in a scatter plot (called a Manhattan plot), in which the peaks indicate regions of the genome associated with that disease. About us About DNA .NET disassembler Clustering IDA plugin Manhattan plot showing the −log10 P values of 606,164 SNPs in the GWAS for 1,472 Japanese atopic dermatitis (also known as atopic eczema, is a non-contagious itchy skin disorder) cases and 7,971 controls plotted against their respective positions on autosomes and the X chromosome www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html
    6. 6. The DNA code is read three letters at a time (these DNA triplets are called codons) Most of the codons correspond to a specific amino acid. However some of the 64 codons code for the same amino acid. Also three of the codons are used as 'stop' signals (STOP codon) and another is the 'start' signal (START codon). This resembles the way a disassembler works. Here the binary machine code is the DNA sequence and the assembly code are the amino acids. About us About DNA .NET disassembler Clustering IDA plugin CCCTGTGGAGCCACACCCTAG CCC TGT GGA GCC ACA CCC TAG Amino acids CIL(MSIL) instructions CCC - Proline 288B00000A call TGT - Cysteine 03 ldarg.1 GGA – Glycine 7D52000004 stfld GCC - Alanine 02 ldarg.0 ACA – Threonine 04 ldarg.2 CCC - Proline 288B00000A call TAG -STOP 2A ret
    7. 7. The CLR header can be reached from the IMAGE_DATA_DIRECTORY structure. Then we have access to the offset to the MetaData header that holds the number of streams. Immediately after, we have the headers for each stream contained inside the file. About us About DNA .NET disassembler Clustering IDA plugin typedef struct CLR_HEADER { DWORD SizeOfStructure; WORD MajorRuntimeVersion; WORD MinorRuntimeVersion; IMAGE_DATA_DIRECTORY MetaData; ….. typedef struct METADATA_HEADER { … IMAGE_DATA_DIRECTORY NoOfStreams; ….. typedef struct STREAM_HEADERSR { DWORD Offset; DWORD Size; unsigned char * Name; …..
    8. 8. We are interested in #~ (the metadata stream) because it contains the information about the methods. - The #~ table header contains a bitmask-QWORD that tells us the tables present in this stream. (For example we can have the TypeRef, TypeDef, MethodDef, Field, etc. tables). Out of all, we are interested in the MethodDef table because it contains the RVAs of the method bodies. - Following the #~ header we have a set of DWORDs specifying the number of rows for each table that is present. - After them we have the actual Metadata tables. - The RVA within the MethodDef table tells us where the body of the method can be found. About us About DNA .NET disassembler Clustering IDA plugin typedef struct TABLE_HEADER { DWORD Reserved; WORD MajorVersion; WORD MinorVersion; … QWORD ValidMask; ….. typedef struct TABLE_METHODDEF { DWORD RVA; WORD ImplFlags; WORD Flags; WORD NameIndex; …..
    9. 9. For each method the RVA is the offset to the first instruction. The Common Intermediate Language (CIL), formerly MSIL, instructions are encoded using a variable-length instruction encoding, where 1 or 2 bytes are used to represent the instruction. We continue to disassemble from the first instruction until we reach RET (opcode 0x2A in CIL). All the instructions are split into basic blocks and we pick only the first operand (FOP). We have a set of rules that will filter out garbage instructions. We then do a CRC on the list of FOPs and add it in the database. About us About DNA .NET disassembler Clustering IDA plugin CIL(MSIL) FOPs 288B00000A call 03 ldarg.1 7D52000004 stfld 02 ldarg.0 04 ldarg.2 288B00000A call 2A ret
    10. 10. Clustering
    11. 11. Clustering - basics Feature set: - CRCIDs representing the hashes of each FOPS present in a given file - Double[ ] file1 = [1, 32, 5673, 5674, 5675, 18001, …, 18607]; Distance measure: - Jaccard index: size of intersection divided by the size of the union of two sets. - Derivate we use: size of smallest of the two sets divided by the size of the union. - Gives a similarity value between 0 and 1, subtracting that to 1 gives us a distance measure. About us About DNA .NET disassembler Clustering IDA plugin
    12. 12. Assume 0.01s on average per distance computation A simplistic implementation would give a complexity of O(n2) - Computing the distance for every possible pair of files - For example, imagine having to cluster 1500 files: (1500) 2 * 0.01 = 22500s (6.25 hours) Clearly doesn’t scale well About us About DNA .NET disassembler Clustering IDA plugin
    13. 13. Our mitigation techniques to improve speed: Loading all the files in memory and ordering them by amount of FOPs they contain. Only compute distance when size ratio is within the threshold value, possible due to properties of our distance computation function. Use of prototypes for agglomerative clustering - In each cluster, the smallest file is elected as “prototype” to represent that cluster. - When doing agglomerative clustering, new files to the prototypes of each clusters until we find a distance within the threshold, or alternatively put the file in a new cluster. About us About DNA .NET disassembler Clustering IDA plugin
    14. 14. About us About DNA .NET disassembler Clustering IDA plugin
    15. 15. Clustering animation – Threshold = 30% 90 35 88 87 40 92 About us About DNA .NET disassembler Clustering IDA plugin
    16. 16. Clustering animation – Threshold = 30% 9035 888740 92 About us About DNA .NET disassembler Clustering IDA plugin
    17. 17. Clustering animation – Threshold = 30% 9035 888740 92 About us About DNA .NET disassembler Clustering IDA plugin
    18. 18. Clustering animation – Threshold = 30% 90 35 888740 92 About us About DNA .NET disassembler Clustering IDA plugin
    19. 19. Clustering animation – Threshold = 30% 90 35 8887 92 About us About DNA .NET disassembler Clustering IDA plugin
    20. 20. Clustering animation – Threshold = 30% 908887 92 35 above threshold! About us About DNA .NET disassembler Clustering IDA plugin
    21. 21. Clustering animation – Threshold = 30% 9088 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
    22. 22. Clustering animation – Threshold = 30% 9088 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
    23. 23. Clustering animation – Threshold = 30% 9088 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
    24. 24. Clustering animation – Threshold = 30% 90 8887 92 35 About us About DNA .NET disassembler Clustering IDA plugin
    25. 25. Clustering animation – Threshold = 30% 90 88 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin
    26. 26. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
    27. 27. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
    28. 28. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
    29. 29. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
    30. 30. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
    31. 31. Clustering animation – Threshold = 30% 88 92 35 87 About us About DNA .NET disassembler Clustering IDA plugin
    32. 32. Clustering animation – Threshold = 30% 35 87 88 About us About DNA .NET disassembler Clustering IDA plugin
    33. 33. Clustering animation – Threshold = 30% 35 87 88 About us About DNA .NET disassembler Clustering IDA plugin
    34. 34. About us About DNA .NET disassembler Clustering IDA plugin 312 1000 1500 4604 7380 6.655 81.644 759.799 945.557 1941.852 Clustering speed (Threshold of 80%) Number of files to cluster Time taken to complete (seconds) 840 1500 7380 3.058 14 35.475 Clustering speed (Threshold of 20%) Number of files to cluster Time taken to complete (seconds)
    35. 35. Time taken to cluster the same 1500 files from the previous example is now drastically improved and follow the threshold value: - With the simplistic approach:  22500s - With mitigation techniques and threshold of 80%:  760s - With mitigation techniques and threshold of 20%:  14s About us About DNA .NET disassembler Clustering IDA plugin
    36. 36. Viewing the clustered data
    37. 37. We need: - a file from the database that we know is malicious (we’ve selected Pameseg/ArchSMS) - a loose cluster that the file is part of (we’ve selected a cluster that had 399 files) Algorithm: - for each CRC present in the target file, we extract the number of files where that CRC is present - calculate the median and remove everything that’s above based on the assumption that most prevalent CRCs are clean (they are also found in clean files). After this step we got 285 files. - use the following formula to get the CRCs that are most probably malicious. k – total number of CRCs Nfi – number of files containing a specific CRC p – the default p-value (0.05) Di – distance of the specific CRC About us About DNA .NET disassembler Clustering IDA plugin
    38. 38. - Using the set of data from gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and- qq-plots.html,,(200,000 SNPs) and applying the same approach we get: About us About DNA .NET disassembler Clustering IDA plugin
    39. 39. Applying the formula on our example dataset of 285 files (that was left after we applied the median) we got a similar result with the GWAS data. We took the first two CRCs and ran a query for each one in order to see which files contain them. The result was a set of 10 files, all of which were found to be malicious and from the same family (Pameseg/ ArchSMS). About us About DNA .NET disassembler Clustering IDA plugin
    40. 40. IDA Python Plugin
    41. 41. About us About DNA .NET disassembler Clustering IDA plugin
    42. 42. About us About DNA .NET disassembler Clustering IDA plugin
    43. 43. About us About DNA .NET disassembler Clustering IDA plugin
    44. 44. About us About DNA .NET disassembler Clustering IDA plugin
    45. 45. About us About DNA .NET disassembler Clustering IDA plugin
    46. 46. Similar to what geneticists are doing in order to analyse genetic variants and identify their link to various diseases, we have implemented a similar approach so it can help us to automatically identify malicious files.
    47. 47. The IDA plugin shows the areas of the code that require more attention. This will reduce the time for manual analysis. We can extend the clustering algorithm to other features like instructions, behaviour data, etc. In the future we plan to extend the approach to other type of files and other platforms.
    48. 48. Will this method be effective with packed files ? Weill this method be effective with obfuscated .NET files ? Does the plugin improve analysis time ? Can the CRCs be used as part of generic detections / family classification ? The effect of the speed mitigation strategies and the used a derivative of the Jaccard index ? Other questions, thoughts, etc…

    ×