Clairvoyant Squirrel: Large Scale Malicious Domain Classification

5,479 views
5,181 views

Published on

Clairvoyant Squirrel: Large Scale Malicious Domain Classification. Presentation from FloCon 2013

Published in: Technology
1 Comment
11 Likes
Statistics
Notes
  • Guys, good deck.
    Consider making aggregations on host, domain, site and ip, subnets. And create features based on this aggregations. Making those aggregations for different time intervals is another way to go. Then, this system will be even more real time.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
5,479
On SlideShare
0
From Embeds
0
Number of Embeds
275
Actions
Shares
0
Downloads
79
Comments
1
Likes
11
Embeds 0
No embeds

No notes for slide

Clairvoyant Squirrel: Large Scale Malicious Domain Classification

  1. 1. John Munro / jmunro@endgame.com Jason Trost / jtrost@endgame.comFlonCon 2013 | January 7–10 | Albuquerque, New Mexico
  2. 2. Introductions • John Munro (jmunro@endgame.com) – Network Security Researcher and Data Scientist • Jason Trost (jtrost@endgame.com) – Senior Software Engineer – Specializes in Hadoop/Storm/BigData
  3. 3. Agenda • The Problem • Our Approach • DGA Domain Classifier • String Statistics as Features • Malicious Domain Classifier • Demo • Real-time Streaming Platform
  4. 4. The Problem txmxbo.info youtube.com yahoo.com Ct0u2xj5dbe4.wvw—game465.com p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com abulqe.com za6.limfoklubs.com ns3.ohio.gov bibz01.apple.com docs.joomla.org Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com
  5. 5. The Problem txmxbo.info youtube.com yahoo.com Ct0u2xj5dbe4.wvw—game465.com p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com abulqe.com za6.limfoklubs.com ns3.ohio.gov bibz01.apple.com docs.joomla.org Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com
  6. 6. The Problem• Massive Volumes – Some of our partners deal with TBs per day of DNS PCAPs• Incredible Rates – One partner sees 13k requests/sec – Another closer to 100k/sec
  7. 7. Our Approach: Machine Learning! • Real-time streaming classification – In parallel across multiple servers • Markov Models – Random Domain Generation Traffic – Normal Benign Traffic • Random Forests – Benign vs Malicious • Periodically retrained – In order to maintain accuracy
  8. 8. Data Sources • Benign Domains – Millions of popular, real domains • Correlated with the Alexa top 10k domains • Malicious Domains – 800k domains gathered from an internal malware sandbox – Public blacklist domains from Conficker and Murofet Botnets
  9. 9. Markov Models
  10. 10. Markovian DGA Classifier • Domain Generation Algorithm (DGA) • Popular Domain Model – Trained: 258,039 domains from Day 1 of our Benign set – Tested: 331,359 domains from Day 2 of our Benign set – Accuracy: 99.40 % with 1,458 Unknown • Randomly Generated Domain Model – Trained: 90,884 domains from Conficker Botnet – Tested: 295,306 domains from Murofet Botnet – Accuracy: 99.34 % with 1,923 Unknown
  11. 11. String Statistics as Features
  12. 12. Feature Usefulness
  13. 13. Random Forests Algorithm FPO VIDEO TO COME
  14. 14. Random Forests • Pros: – Very high accuracy – Scalable across many nodes – Built-in protection from over fitting – Can handle very large data sets with many features – Robust with respect to goodness of features – Practical for real world use – Does not assume a distribution – Only two parameters to tune – Memory efficient • Cons: – Not the quickest classifier, but plenty fast in practice
  15. 15. Malicious Domain Classifier • Performance measured by 10 – fold Cross Validation • Training Set Cross Validation – 200k Benign 0.06 0.05 – 200k Malicious Out of Bag Error 0.04 0.03 0.02 0.01 0 10 20 30 40 50 60 70 80 90 100 Number of Trees K=3 K=5 K = 10
  16. 16. Results Bad Precision Good Precision 0.984 0.9795 0.983 0.979 0.982 0.9785 0.981 0.978 0.9775 PrecisionPrecision 0.98 0.977 0.979 0.9765 0.978 0.976 0.977 0.9755 0.976 0.975 0.975 0.9745 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number of Trees Number of Trees Bad Accuracy Good Accuracy 0.9805 0.9805 0.98 0.98 0.9795 0.9795 0.979 0.979Accuracy Accuracy 0.9785 0.9785 0.978 0.978 0.9775 0.9775 0.977 0.977 0.9765 0.9765 0.976 0.976 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number of Trees Number of Trees K=3 K=5 K = 10
  17. 17. Results Model Size 400 350 300 Model Size (MB) 250 200 K=3 150 K=5 100 K=10 50 0 10 20 30 40 50 60 70 80 90 100 Number of Trees Classification Throughput 60,000 50,000 Classifications/sec 40,000 K=3 30,000 K=5 K=10 20,000 10,000 0 10 20 30 40 50 60 70 80 90 100 Number of Trees
  18. 18. Results
  19. 19. Demo
  20. 20. Realtime Streaming Platform • Velocity is a platform for processing, analyzing, and visualizing large-scale event data in real- time • It was designed to be horizontally scalable and is built using Twitter’s Storm • It was built primarily for internal use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized
  21. 21. Velocity Pipeline
  22. 22. Conclusion • Malicious domain classification • DGA domain identification using Markov Models • Summary Statistics based on domain string work well • Random Forests are very successful at classifying domains as Benign or Malicious • Real-time, distributed implementation
  23. 23. Future Work • Include more features: TTL, frequency seen, etc. • Correlation of bad domains based on ASN, Country, Organization, etc. • Identify subnets that are infected based on high traffic to bad domains • Identify Content Delivery Networks • Self Organizing Maps and other visualizations
  24. 24. Questions
  25. 25. Contact Information • John Munro • Email: jmunro@endgame.com • Jason Trost • Email: jtrost@endgame.com • Twitter: @jason_trost • Blog: www.covert.io

×