SlideShare a Scribd company logo
John Munro / jmunro@endgame.com
              Jason Trost / jtrost@endgame.com

FlonCon 2013 | January 7–10 | Albuquerque, New Mexico
Introductions
 • John Munro (jmunro@endgame.com)
   – Network Security Researcher and Data Scientist


 • Jason Trost (jtrost@endgame.com)
   – Senior Software Engineer
   – Specializes in Hadoop/Storm/BigData
Agenda

 •   The Problem
 •   Our Approach
 •   DGA Domain Classifier
 •   String Statistics as Features
 •   Malicious Domain Classifier
 •   Demo
 •   Real-time Streaming Platform
The Problem
               txmxbo.info                              youtube.com

      yahoo.com
                                Ct0u2xj5dbe4.wvw—game465.com

  p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com


           abulqe.com                       za6.limfoklubs.com

  ns3.ohio.gov                                      bibz01.apple.com
                              docs.joomla.org
 Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com
The Problem
               txmxbo.info                              youtube.com

      yahoo.com
                                Ct0u2xj5dbe4.wvw—game465.com

  p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com


           abulqe.com                       za6.limfoklubs.com

  ns3.ohio.gov                                      bibz01.apple.com
                              docs.joomla.org
 Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com
The Problem

• Massive Volumes
  – Some of our partners
    deal with TBs per day
    of DNS PCAPs
• Incredible Rates
  – One partner sees
    13k requests/sec
  – Another closer to
    100k/sec
Our Approach: Machine Learning!

 • Real-time streaming classification
   – In parallel across multiple servers
 • Markov Models
   – Random Domain Generation Traffic
   – Normal Benign Traffic
 • Random Forests
   – Benign vs Malicious
 • Periodically retrained
   – In order to maintain accuracy
Data Sources
 • Benign Domains
   – Millions of popular, real domains
      • Correlated with the Alexa top 10k domains
 • Malicious Domains
   – 800k domains gathered from an internal malware
     sandbox
   – Public blacklist domains from Conficker and
     Murofet Botnets
Markov Models
Markovian DGA Classifier
 • Domain Generation Algorithm (DGA)
 • Popular Domain Model
   – Trained: 258,039 domains from Day 1 of our Benign set
   – Tested: 331,359 domains from Day 2 of our
     Benign set
   – Accuracy: 99.40 % with 1,458 Unknown
 • Randomly Generated Domain Model
   – Trained: 90,884 domains from Conficker Botnet
   – Tested: 295,306 domains from Murofet Botnet
   – Accuracy: 99.34 % with 1,923 Unknown
String Statistics as Features
Feature Usefulness
Random Forests Algorithm




                  FPO
             VIDEO TO COME
Random Forests
 • Pros:
    –   Very high accuracy
    –   Scalable across many nodes
    –   Built-in protection from over fitting
    –   Can handle very large data sets with many features
    –   Robust with respect to goodness of features
    –   Practical for real world use
    –   Does not assume a distribution
    –   Only two parameters to tune
    –   Memory efficient
 • Cons:
    – Not the quickest classifier, but plenty fast in practice
Malicious Domain Classifier


 • Performance measured by 10 – fold Cross
   Validation
 • Training Set            Cross Validation
   – 200k Benign                         0.06

                                         0.05

   – 200k Malicious
                      Out of Bag Error



                                         0.04

                                         0.03

                                         0.02

                                         0.01

                                           0
                                                10   20   30     40     50      60       70   80   90   100
                                                                      Number of Trees

                                                               K=3      K=5          K = 10
Results
                               Bad Precision                                                                   Good Precision
            0.984                                                                           0.9795
            0.983                                                                            0.979
            0.982                                                                           0.9785
            0.981                                                                            0.978
                                                                                            0.9775




                                                                                Precision
Precision




             0.98
                                                                                             0.977
            0.979
                                                                                            0.9765
            0.978                                                                            0.976
            0.977                                                                           0.9755
            0.976                                                                            0.975
            0.975                                                                           0.9745
                     10   20   30   40      50    60       70   80   90   100                        10   20    30   40      50    60       70    80   90     100
                                         Number of Trees                                                                  Number of Trees
                               Bad Accuracy                                                                    Good Accuracy
            0.9805                                                                          0.9805
              0.98                                                                            0.98
            0.9795                                                                          0.9795
             0.979                                                                           0.979
Accuracy




                                                                                Accuracy

            0.9785                                                                          0.9785
             0.978                                                                           0.978
            0.9775                                                                          0.9775
             0.977                                                                           0.977
            0.9765                                                                          0.9765
             0.976                                                                           0.976
                     10   20   30   40      50    60       70   80   90   100                        10   20    30   40      50    60       70    80   90     100
                                         Number of Trees                                                                  Number of Trees
                                                                                                                                  K=3            K=5        K = 10
Results
                                                                          Model Size
                                              400
                                              350
                                              300
                            Model Size (MB)   250
                                              200                                                                K=3
                                              150                                                                K=5
                                              100                                                                K=10
                                               50
                                                0
                                                    10   20     30   40      50    60       70   80   90   100
                                                                          Number of Trees



                                                              Classification Throughput
                             60,000

                             50,000
      Classifications/sec




                             40,000                                                                              K=3

                             30,000                                                                              K=5
                                                                                                                 K=10
                             20,000

                             10,000

                                                0
                                                    10   20     30   40      50    60       70   80   90   100
                                                                          Number of Trees
Results
Demo
Realtime Streaming Platform

 • Velocity is a platform for processing, analyzing,
   and visualizing large-scale event data in real-
   time
 • It was designed to be horizontally scalable and
   is built using Twitter’s Storm

                • It was built primarily for internal
                  use with DNS events, IDS alerts,
                  and netflow data, but it is in the
                  process of being commercialized
Velocity Pipeline
Conclusion
 • Malicious domain classification
 • DGA domain identification using Markov
   Models
 • Summary Statistics based on domain string
   work well
 • Random Forests are very successful at
   classifying domains as Benign or Malicious
 • Real-time, distributed implementation
Future Work

 • Include more features: TTL, frequency seen,
   etc.
 • Correlation of bad domains based on ASN,
   Country, Organization, etc.
 • Identify subnets that are infected based on
   high traffic to bad domains
 • Identify Content Delivery Networks
 • Self Organizing Maps and other visualizations
Questions
Contact Information
 • John Munro
 • Email: jmunro@endgame.com

 •   Jason Trost
 •   Email: jtrost@endgame.com
 •   Twitter: @jason_trost
 •   Blog: www.covert.io

More Related Content

More from Jason Trost

BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware SandboxesBSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
Jason Trost
 
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence AnalysisDistributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
Jason Trost
 
An Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware SandboxesAn Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware Sandboxes
Jason Trost
 
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Jason Trost
 
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Jason Trost
 
Modern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security HackersModern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security Hackers
Jason Trost
 
Modern Honey Network (MHN)
Modern Honey Network (MHN)Modern Honey Network (MHN)
Modern Honey Network (MHN)
Jason Trost
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in Hadoop
Jason Trost
 
Accumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigAccumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and Pig
Jason Trost
 

More from Jason Trost (9)

BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware SandboxesBSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
BSidesNYC 2016 - An Adversarial View of SaaS Malware Sandboxes
 
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence AnalysisDistributed Sensor Data Contextualization for Threat Intelligence Analysis
Distributed Sensor Data Contextualization for Threat Intelligence Analysis
 
An Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware SandboxesAn Adversarial View of SaaS Malware Sandboxes
An Adversarial View of SaaS Malware Sandboxes
 
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
Deploying, Managing, and Leveraging Honeypots in the Enterprise using Open So...
 
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
Lessons Learned from Building and Running MHN, the World's Largest Crowdsourc...
 
Modern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security HackersModern Honey Network at Bay Area Open Source Security Hackers
Modern Honey Network at Bay Area Open Source Security Hackers
 
Modern Honey Network (MHN)
Modern Honey Network (MHN)Modern Honey Network (MHN)
Modern Honey Network (MHN)
 
BinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in HadoopBinaryPig - Scalable Malware Analytics in Hadoop
BinaryPig - Scalable Malware Analytics in Hadoop
 
Accumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and PigAccumulo Nutch/GORA, Storm, and Pig
Accumulo Nutch/GORA, Storm, and Pig
 

Recently uploaded

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Recently uploaded (20)

HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

Clairvoyant Squirrel: Large Scale Malicious Domain Classification

  • 1. John Munro / jmunro@endgame.com Jason Trost / jtrost@endgame.com FlonCon 2013 | January 7–10 | Albuquerque, New Mexico
  • 2. Introductions • John Munro (jmunro@endgame.com) – Network Security Researcher and Data Scientist • Jason Trost (jtrost@endgame.com) – Senior Software Engineer – Specializes in Hadoop/Storm/BigData
  • 3. Agenda • The Problem • Our Approach • DGA Domain Classifier • String Statistics as Features • Malicious Domain Classifier • Demo • Real-time Streaming Platform
  • 4. The Problem txmxbo.info youtube.com yahoo.com Ct0u2xj5dbe4.wvw—game465.com p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com abulqe.com za6.limfoklubs.com ns3.ohio.gov bibz01.apple.com docs.joomla.org Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com
  • 5. The Problem txmxbo.info youtube.com yahoo.com Ct0u2xj5dbe4.wvw—game465.com p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com abulqe.com za6.limfoklubs.com ns3.ohio.gov bibz01.apple.com docs.joomla.org Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com
  • 6. The Problem • Massive Volumes – Some of our partners deal with TBs per day of DNS PCAPs • Incredible Rates – One partner sees 13k requests/sec – Another closer to 100k/sec
  • 7. Our Approach: Machine Learning! • Real-time streaming classification – In parallel across multiple servers • Markov Models – Random Domain Generation Traffic – Normal Benign Traffic • Random Forests – Benign vs Malicious • Periodically retrained – In order to maintain accuracy
  • 8. Data Sources • Benign Domains – Millions of popular, real domains • Correlated with the Alexa top 10k domains • Malicious Domains – 800k domains gathered from an internal malware sandbox – Public blacklist domains from Conficker and Murofet Botnets
  • 10. Markovian DGA Classifier • Domain Generation Algorithm (DGA) • Popular Domain Model – Trained: 258,039 domains from Day 1 of our Benign set – Tested: 331,359 domains from Day 2 of our Benign set – Accuracy: 99.40 % with 1,458 Unknown • Randomly Generated Domain Model – Trained: 90,884 domains from Conficker Botnet – Tested: 295,306 domains from Murofet Botnet – Accuracy: 99.34 % with 1,923 Unknown
  • 13. Random Forests Algorithm FPO VIDEO TO COME
  • 14. Random Forests • Pros: – Very high accuracy – Scalable across many nodes – Built-in protection from over fitting – Can handle very large data sets with many features – Robust with respect to goodness of features – Practical for real world use – Does not assume a distribution – Only two parameters to tune – Memory efficient • Cons: – Not the quickest classifier, but plenty fast in practice
  • 15. Malicious Domain Classifier • Performance measured by 10 – fold Cross Validation • Training Set Cross Validation – 200k Benign 0.06 0.05 – 200k Malicious Out of Bag Error 0.04 0.03 0.02 0.01 0 10 20 30 40 50 60 70 80 90 100 Number of Trees K=3 K=5 K = 10
  • 16. Results Bad Precision Good Precision 0.984 0.9795 0.983 0.979 0.982 0.9785 0.981 0.978 0.9775 Precision Precision 0.98 0.977 0.979 0.9765 0.978 0.976 0.977 0.9755 0.976 0.975 0.975 0.9745 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number of Trees Number of Trees Bad Accuracy Good Accuracy 0.9805 0.9805 0.98 0.98 0.9795 0.9795 0.979 0.979 Accuracy Accuracy 0.9785 0.9785 0.978 0.978 0.9775 0.9775 0.977 0.977 0.9765 0.9765 0.976 0.976 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Number of Trees Number of Trees K=3 K=5 K = 10
  • 17. Results Model Size 400 350 300 Model Size (MB) 250 200 K=3 150 K=5 100 K=10 50 0 10 20 30 40 50 60 70 80 90 100 Number of Trees Classification Throughput 60,000 50,000 Classifications/sec 40,000 K=3 30,000 K=5 K=10 20,000 10,000 0 10 20 30 40 50 60 70 80 90 100 Number of Trees
  • 19. Demo
  • 20. Realtime Streaming Platform • Velocity is a platform for processing, analyzing, and visualizing large-scale event data in real- time • It was designed to be horizontally scalable and is built using Twitter’s Storm • It was built primarily for internal use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized
  • 22. Conclusion • Malicious domain classification • DGA domain identification using Markov Models • Summary Statistics based on domain string work well • Random Forests are very successful at classifying domains as Benign or Malicious • Real-time, distributed implementation
  • 23. Future Work • Include more features: TTL, frequency seen, etc. • Correlation of bad domains based on ASN, Country, Organization, etc. • Identify subnets that are infected based on high traffic to bad domains • Identify Content Delivery Networks • Self Organizing Maps and other visualizations
  • 25. Contact Information • John Munro • Email: jmunro@endgame.com • Jason Trost • Email: jtrost@endgame.com • Twitter: @jason_trost • Blog: www.covert.io