Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Talha Obaid, Email Security, Symantec at MLconf ATL 2017

714 views

Published on

A Machine Learning approach for detecting a Malware:
The project is to improve the way we detect script based malware using Machine Learning. Malware has become one of the most active channel to deliver threats like Banking Trojans and Ransomware. The talk is aimed at finding a new and effective way to detect the malware. We started with acquiring both malicious and clean samples. Later we performed feature identification, while building on top of existing knowledge base of malware. Then we performed automated feature extraction. After certain feature set is obtained, we teased-out feature which are categorical, interdependent or composite. We applied varying machine learning models, producing both binary and categorical outcomes. We cross validated our results and re-tuned our feature set and our model, until we obtained satisfying results, with least false-positives. We concluded that not all the extracted features are significant, in fact some features are detrimental on the model performance. Once such features are factored-out, it results not only in better match, but also provides a significant gain in performance.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Talha Obaid, Email Security, Symantec at MLconf ATL 2017

  1. 1. Machine Learning for Detecting Malware Talha Obaid Ling Zhou Timothy You Xinlei Cai MLConf – Atlanta Sep 2017 Email Security Scripting
  2. 2. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only The Team! Ling Zhou Timothy You Xinlei Cai Talha Obaid
  3. 3. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Machine Learning @ Symantec • Early adopter of ML in industry • SRL – Symantec Research Labs • CAML – Centre for Advanced Machine Learning • Malware detection, spam identification • Helped achieve the compounded impact • Malware polymorphism https://www.symantec.com/connect/blogs/meet-symantec-labs-industrys-best-kept-secret
  4. 4. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Reference: https://www.symantec.com/connect/blogs/machine-learning-not-only-answer How I got infected?
  5. 5. Email – as a carrier!
  6. 6. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Email is the weapon of choice! • One in 131 emails contained malicious link or attachment, the highest rate in five years • The rate jumped from 1 in 220 emails in 2015 to 1 in 131 emails in 2016 • In 2016 Small to Medium sized Businesses were the most impacted by phishing attacks with 1 in 95 emails containing malware • Email sent daily in 2016 – 269 billion* • The general office worker receives an average of 600 emails per week* • Blended attacks - Email as a career for malicious URL • Office document files are an effective weapon • Lighter footprint and hiding in plain sight Reference: https://www.symantec.com/security-center/threat-report * Email Statistics Report, 2017-2021, Radicati Group, February 2017 Copyright © Symantec
  7. 7. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Worldwide Email Forecast Worldwide Email Users* (M) 3,718 3,823 3,930 4,037 4,147 % Growth 3% 3% 3% 3% Reference: https://www.radicati.com/wp/wp-content/uploads/2017/01/Email-Statistics-Report-2017-2021-Executive-Summary.pdf * Includes both Business and Consumer Email users Daily Email Traffic 2017 2018 2019 2020 2021 Total Worldwide Emails Sent/Received Per Day (B) 269.0 281.1 293.6 306.4 319.6 % Growth 4.5% 4.4% 4.4% 4.3% Worldwide Daily Email Traffic (B), 2017-2021 Worldwide Email User Forecast (M), 2017–2021
  8. 8. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Email: Locky malware delivery vector Reference: https://www.symantec.com/security-center/threat-report http://www.latimes.com/business/technology/la-me-ln-hollywood-hospital-bitcoin-20160217-story.html https://arstechnica.com/information-technology/2016/02/locky-crypto-ransomware-rides-in-on-malicious-word-document-macro/ Copyright © Symantec • Released in 2016 • Still active in 2017 • “Enable macro if data encoding is incorrect” • If the user does enable macros, the macros then save and run a binary file that downloads the actual encryption Trojan • Hospital in Hollywood payed $17,000 in bitcoin to hackers
  9. 9. Scripting Malware – real ones!
  10. 10. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Exampli Gratia AutoClose, Random variable, String split
  11. 11. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Fake variable Fake comment Fake condition
  12. 12. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Multiple Function String split
  13. 13. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only String encryption Random variable Function Call hidden
  14. 14. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only String Encryption Random variable Multi function Click event
  15. 15. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only String hidden Fake condition
  16. 16. Machine Learning for hand-written text!
  17. 17. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Domain Differences Programming Language • Non-Ambiguous • Deterministic language • Clear distinction between syntax and semantics • Semicolons, Tabs vs Spaces, Editor wars • Identifier, sub routine calls, imports • Comments, conventions, notations • Design patterns Natural Language • Ambiguous • Context-bound languages • Less distinguished between syntax and semantic • Puns, Rants, Parodies, Imitations • TF-IDF • LSTM – Long short term memory • Bag of words Copyright © Symantec
  18. 18. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Machine Learning Applications – Code! Automatic Patch Generation by Learning Correct Code by Fan et. al. Reference: https://www.newscientist.com/article/mg23331144-500-ai-learns-to-write-its-own-code-by-stealing-from-other-programs/ http://people.csail.mit.edu/rinard/paper/popl16.pdf Copyright © Symantec
  19. 19. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only https://www.forbes.com/sites/adrianbridgwater/2016/03/07/machine-learning-needs-a-human-in-the-loop https://blogs.technet.microsoft.com/machinelearning/2016/10/17/the-power-of-human-in-the-loop-combine-human-intelligence-with-machine-learning/ Human-In-The-Loop?
  20. 20. How we do it
  21. 21. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Rule ^ ML Email Analyze Inflation Macro Extraction Parsing Feature Extraction Copyright © Symantec
  22. 22. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Feature Selection (Total 72 Features) ML_1... ML_12… ML_2... ML_13… ML_3... ML_14…* ML_4... ML_15… ML_5... ML_16… ML_6... ML_17… ML_7… ML_18… ML_8… ML_19… ML_9… ML_20… ML_10… ML_21…* ML_11… … Note: Features with (*) can be expanded to the count of each item. ML_21_1… ML_14_1… ML_21_2… ML_14_1… ML_21_3… ML_14_1… ML_21_4… ML_14_1… ML_21_5… ML_14_1… ML_21_1… ML_14_1… ML_21_1… ML_14_1… ML_21_1… ML_14_1… ML_21_1… ML_14_1… ML_21_1… ML_14_1… … 29 features … 21 features
  23. 23. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Optimization ML_1… (Composite) ML_2… ML_3… ML_4… ML_14_3… 1 31469 1245 35 211 0 2 44617 1264 14 171 0 3 33247 1045 14 158 0 … … … … … … 1234 18828 682 29 222 1 … … … … … … 40000 1273048 844 19 151 0 • Treat ML_1… feature since it is dependent on other features. • Treat features like ML_14_3… since categorical feature.
  24. 24. Results – very recent ones!
  25. 25. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Spam run – from Aug 21 to Aug 27 { "desc": "Shell call", "artifact": " Shell "Explorer.exe " & strCommande, vbNormalFocus, " }, Copyright © Symantec
  26. 26. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Just this morning … 15 Sep 2017
  27. 27. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Recently captured… { "desc": "Small routine with string manipulation", "artifact": " Chinook = (AscB(Sumatran_Rhinoceros))" "artifact": " Tapir = Chinook(Mid(Sand_Lizard, Chipmunk, 1)) - Int(M..." }, { "desc": "Small routine with run & Obfuscated object concat & Obfuscated object creation arguments shell & Createobject run one-liner", "artifact": " CreateObject(Pig + "Shell").Run Module1.Ibis(Sea_Dragon, "" },
  28. 28. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only { "desc": "Obfuscated object variable", "artifact": "Set miLxhuTjOMrpjvLQQNhstoiWlCkOdozYkasyizjweDRGlKRkgtkgxHZyAoLfJFFaMSFJDNiRekNpWbkbkzhjETbcA tytnDmZxruTFIhTLSCM = CreateObject(ujcYEkvJXWWtqcIKOpdaxorehRVbSNYlQPiQQao" }, { "desc": "Obfuscated object creation arguments", "artifact": "Set qvBvooYSTaFymchvnZIkLUSrhheHIwfYCSyrpgvjePoCKWbhMYoOBOJVcKO = CreateObject(kbUBGIKqbHJyTmAmPbuHSBjqouVxfwCfSfEWfcNXxXYAhCJKXcegnoejsdNMnNKeFdfnieGnOXJv cjJlkKZDSV" }, { "desc": "Long obfuscated variable assignment", "artifact": "ZGwEiLSTkOsQSFcFzZVPMMuHalgKESzgWlohddzbmveToRIxzt" },
  29. 29. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only { "desc": "Macro with constant manipulation in function call", "artifact": "dNDfJESUPztgDlcNnWNZLIPsGgXDVndgUDYaarDOIWeCVstlSACjSVcUyLZ = CWvXJUNlxQcbDqNtnmQhCsifqGFBSHE$(327 - 240) & CWvXJUNlxQcbDqNtnmQhCsifqGFBSHE$(324 - 241) & CWvXJUNl…" }, { "desc": "Highly random long string found", "artifact": "mRClEXzmRGxUqDPLJHcHeEMgjtqozQbuXXYIpdNJOtykVB" }, { "desc": "Object creation variable identifier", "artifact": "qvBvooYSTaFymchvnZIkLUSrhheHIwfYCSyrpgvjePoCKWbhMYoOBOJVcKO" }, { "desc": "Random subroutine name", "artifact": "dnHLjlClNBEYNnZihnFPOighaDbyTOUim" },
  30. 30. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only { "desc": "Random identifier with suspicious assignments", "artifact": "ujcYEkvJXWWtqcIKOpdaxorehRVbSNYlQPiQQaoCIdBbVAdczWFVpbOGsxrmOTqKykcaurtoAaRUmQJgntcvICwoBcYTiBopmrc kXChHdQUOKtTcnKzV = Chr$(327 - 240) & Chr$(324 - 241) & Chr$(24…" }, { "desc": "Shell/SaveToFile string contains strange variable name", "artifact": "RhIzeRHLbzssvNwesaErYKfXuynMPZjWdUBgPAZZUnlhknaNjNAQERoHClFgeuvBPWPbMQPsAeXlYymHXZdCZTRMfteev" }, { "desc": "File with following name was created and run created", "artifact": "XABNAGkAYwByAG8AcwBvAGYAdAA=XABxAGIASwBWAEsAdgBsAGgAdwBpAEoAUgBLAC4AZQB4AGUA" }, And… we capture a lot more!
  31. 31. Copyright © 2017 Symantec Corporation SYMANTEC PROPRIETARY– Limited Use Only Findings & Going Forward … • “If an artifact is missing” means a sample is missed – not anymore • All features contribute to the verdict in unison • Obfuscation is still a challenge and will remain to be one • Identify why a variable of string type is assigned a byte array? • Why an assignment expression is more than say 200 characters? • Keep transitioning inflating malware samples from sandbox to static analysis
  32. 32. Thank You! Talha Obaid Ling Zhou Timothy You Xinlei Cai Email Security Join us! www.symantec.com/about/careers

×