Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AI approach to malware similarity analysis: Maping the malware genome with a deep neural network


Published on

In recent years, cyber defenders protecting enterprise networks have started incorporating malware code sharing identification tools into their workflows. These tools compare new malware samples to a large databases of known malware samples, in order to identify samples with shared code relationships. When unknown malware binaries are found to share code "fingerprints" with malware from known adversaries, they provides a key clue into which adversary is generating these new binaries, thus helping develop a general mitigation strategy against that family of threats. The efficacy of code sharing identification systems is demonstrated every day, as new family of threats are discovered, and countermeasures are rapidly developed for them. Unfortunately, these systems are hard to maintain, deploy, and adapt to evolving threats. First and foremost, these systems do not learn to adapt to new malware obfuscation strategies, meaning they will continuously fall out of date with adversary tradecraft, requiring, periodically, a manually intensive tuning in order to adjust the formulae used for similarity between malware. In addition, these systems require an up to date, well maintained database of recent threats in order to provide relevant results. Such a database is difficult to deploy, and hard and expensive to maintain for smaller organizations. In order to address these issues we developed a new malware similarity detection approach. This approach, not only significantly reduces the need for manual tuning of the similarity formulate, but also allows for significantly smaller deployment footprint and provides significant increase in accuracy. Our family/similarity detection system is the first to use deep neural networks for code sharing identification, automatically learning to see through adversary tradecraft, thereby staying up to date with adversary evolution. Using traditional string similarity features our approach increased accuracy by 10%, from 65% to 75%. Using an advanced set of features that we specifically designed for malware classification, our approach has 98% accuracy. In this presentation we describe how our method works, why it is able to significantly improve upon current approaches, and how this approach can be easily adapted and tuned to individual/organization needs of the attendees.
(Source: Black Hat USA 2016, Las Vegas)

Published in: Technology
  • Be the first to comment

AI approach to malware similarity analysis: Maping the malware genome with a deep neural network

  1. 1. An AI Approach to Malware Similarity Analysis: Mapping the Malware Genome With a Deep Neural Network Dr. Konstantin Berlin Senior Research Engineer Invincea Labs 1
  2. 2. Enterprise Networks are Under Constant Attack • Intelligence is critical for prevention • Cyber defenders are overwhelmed • AI can help find important relations 2 Number of Network Breaches Per Year (Verizon’s 2016 Data Breach Investigations Report)
  3. 3. Intelligence through Malware Triage • Benefits • Identify threat actors • Link various attacks to a single actor • Quickly understand functionality • Speed up reverse engineering • Mitigation • Signatures • Network RulesEnterprise Network 3
  4. 4. Nearest-Neighbor Classification Feature A Feature B 4 $$$ ? Similarity Search • MinHash • Feature hashing • Other sketching • … Jang, Jiyong et. al. Proceedings of the 18th ACM conference on Computer and communications security. ACM, 2011. Sæbjørnsen, Andreas, et al. Proceedings of 18th international symposium on Software testing and analysis. ACM, 2009. Bayer, Ulrich, et al. NDSS. Vol. 9. 2009. …Many more Attribute Embedding A B C D E Attribute Extraction Attributes • Byte n-grams • Opcode n-grams • Printable strings • System calls • …
  5. 5. What Can Go Wrong with Embedding? • Embeddings skew distances • Same embedded data can give different neighbors (ex. Alaska and Russia) 5
  6. 6. Issues with Attribute Embedding 6 Feature A Feature B Possible Attributes #1 Feature A Feature B Possible Attributes #2 ? ? How to get consistent results, regardless of features?
  7. 7. Supervised Classification (Endpoint Solution) 7 Layer 1 Layer 2 Layer 3 Layer N [0,1] … Layer X backdoor.msil.bladabindi.aa backdoor.msil.bladabindi.aj worm.winnt.lurka.a… XJoshua Saxe and Konstantin Berlin, (MALWARE). IEEE, 2015. 0.97 F1-score (precision and recall) • 1500 Microsoft Families • 2.0M Training files Malware Detection Categorical Classification Not enough for a triage system! ABCDE
  8. 8. How do we map malware into an embedding so that distances make semantic sense? 8
  9. 9. Imaginary World of Malware Factories • Ideal World • Each hidden factory produces one malware family/variant • Factories are positioned relative to what and how they exploit vulnerabilities • …but this not what we have!? 9 Secret sauce A Secret sauce B Idealized Embedding
  10. 10. There is No Spoon Embedding… • We created the embedding when we selected the features • We can morph them in any way we choose • One good way to morph the features is using a deep neural network 10 Deep Neural Network A B C D E a b c d “The Matrix”, 1999
  11. 11. Morphing the Embedding 11 Deep Neural Network (Morphing) Noise Predicted Families worm.win32.vobfus.hc … Variational Autoencoder Only one factory Embedding clusters families Kingma, D. P., & Welling, M. (2013). arXivpreprint arXiv:1312.6114. A B C D E a b c d Encoder Decoder Embedding Family Labels Secret sauce
  12. 12. Embedding Visualization • Toy Example • 8 family/variant prediction • 2D embedding 12 Only one factory Embedding clusters families a b a b Collapse into a singularity Inter-class distances not preserved aa b b A B C D E
  13. 13. Printable Strings Results • 800K samples • 1500 family/variants (99% coverage) • Time-split Validation • Train on old data • Test on 30 days later • Measure F1-score of 3-nearest neighbor classifier 13 Feature VectorSemantic Embedding Deep-learning Features F1-score 0.44 F1-score 0.94 F1-score 0.96F1-score 0.66 no gap gap small gap large gap
  14. 14. Conclusion • Developing feature extraction is expensive and requires time consuming tuning to adapt to a specific domain • Traditional approaches to malware similarity are unsupervised and so are brittle • Using supervised-learning approaches we can improve existing features by embedding them into a more optimized space • Automatic (re)tuning will improve detection rates and reduce cost 14
  15. 15. More Information • Email: • Twitter: @kberlin 15