Applying Machine Learning to Network Security Monitoring - BayThreat 2013


Published on

Video (at YouTube) -

Big Data Security Analytics, Data Science and Machine Learning are a few of the new buzzwords that have invaded out industry of late. Most of what we hear are promises of an unicorn-laden, silver-bullet panacea by heavy-handed marketing folks, evoking an expected pushback from the most enlightened members of our community.

This talk will help parse what we as a community need to know and understand about these concepts and help understand where the technical details and actual capabilities of those concepts and also where they fail and how they can be exploited and fooled by an attacker.

The talk will also share results of the author's current ongoing research (on MLSec Project) of applying machine learning techniques to information secuirty monitoring.

Published in: Technology, Education

Applying Machine Learning to Network Security Monitoring - BayThreat 2013

  1. 1. Applying  Machine  Learning  to  Network   Security  Monitoring   Alexandre  Pinto   Chief  Data  Scien4st  |  MLSec  Project       @alexcpsec @MLSecProject!
  2. 2. WARNING!   •  This  is  a  talk  about  BUILDING  not  breaking   –  NO  systems  were  harmed  on  the  development  of  this  talk.   –  This  is  NOT  about  1337  Android  Malware   •  Only  thing  we  are  likely  to  break  here  is  the  4me  limit  on  the   talk     •  This  talk  includes  more  MATH  than  the  daily  recommended   intake  by  the  FDA.   •  All  stunts  described  in  this  talk  were  performed  by  trained   professionals.!
  3. 3. Who's  Alex?   •  13  years  in  Informa4on  Security,  done  a  liRle  bit  of  everything.   •  Past  7  or  so  years  leading  security  consultancy  and  monitoring   teams  in  Brazil,  London  and  the  US.   –  If  there  is  any  way  a  SIEM  can  hurt  you,  it  did  to  me.   •  Researching  machine  learning  and  data  science  in  general  for   the  past  year  or  so  and  presen4ng  about  the  intersec4on  of  it   and  Infosec  throughout  the  year.   •  Created  MLSec  Project  in  July  2013  to  give  structure  to  the   research  being  done.  
  4. 4. Agenda   •  Defini4ons   •  Big  Data   •  Data  Science   •  Machine  Learning   •  •  •  •  •  Y  U  DO  DIS?   Network  Security  Monitoring   PoC  ||  GTFO   Feature  Intui4on   How  to  get  started?  
  5. 5.   Big  Data  +  Machine  Learning  +  Data  Science
  6. 6.   Big  Data  +  Machine  Learning  +  Data  Science
  7. 7. Big  Data  
  8. 8. (Security)  Data  ScienEst   •  “Data  Scien4st  (n.):  Person  who  is  beRer  at  sta4s4cs  than  any  so`ware   engineer  and  beRer  at  so`ware  engineering  than  any  sta4s4cian.”  -­‐-­‐  Josh  Willis,  Cloudera   Data  Science  Venn  Diagram  by  Drew  Conway!
  9. 9. Enter  Machine  Learning   •  “Machine  learning  systems  automa4cally  learn  programs   from  data”  (*)   •  You  don’t  really  code  the  program,  but  it  is  inferred   from  data.   •  Intui4on  of  trying  to  mimic  the  way  the  brain  learns:     that's  where  terms  like  ar#ficial  intelligence  come  from. ! (*)  CACM  55(10)  -­‐  A  Few  Useful  Things  to  Know  about  Machine  Learning  (Domingos  2012)  
  10. 10. Kinds  of  Machine  Learning   •  Supervised  Learning:   –  Classifica4on  (NN,  SVM,  Naïve   Bayes)   –  Regression  (linear,  logis4c)! •  Unsupervised  Learning  :   –  Clustering  (k-­‐means)   –  Decomposi4on  (PCA,  SVD)   Source  –  scikit-­‐­‐learn-­‐tutorial/general_concepts.html  
  11. 11. ClassificaEon  Example   VS!
  12. 12. Regression  Example  
  13. 13. ConsideraEons  on  Data  Gathering   •  Models  will  (generally)  get  beRer  with  more  data   –  But  we  always  have  to  consider  bias  and  variance  as  we  select  our  data   points   –  Also  adversaries  –  we  may  be  force  fed  “bad  data”,  find  signal  in  weird   noise  or  design  bad  (or  exploitable)  features   •  “I’ve  got  99  problems,  but  data  ain’t  one”! Domingos,  2012   Abu-­‐Mostafa,  Caltech,  2012  
  14. 14. ApplicaEons  of  Machine  Learning   •  Sales ! •  Trading   •  Image  and   Voice   Recogni4on  
  15. 15. Y  U  DO  DIS?   •  Common  reac4ons  from  Security  Professionals:   •  “Eh,  cool…”  *blank  stare*  *walks  away*   •  “Are  you  high,  bro?”   •  “Why  aren’t  you  doing  some  cool  research  like  Android   Malware?”  
  16. 16. Math  is  HARD  
  17. 17. Security  ApplicaEons  of  ML   •  Fraud  detec4on  systems:   –  Is  what  he  just  did  consistent  with  past   behavior?   •  Network  anomaly  detec4on  (?):   –  More  like  bad  sta4s4cal  analysis   –  Did  not  advance  a  lot,  IMO   •  Predic4ng  likelihood  of  aRack  actors   –  Create  different  predic4ve  models  and   chain  them  to  gain  more  confidence  in  each   step.! •  SPAM  filters  
  18. 18. ConsideraEons  on  Data  Gathering   •  Adversaries  -­‐  Exploi4ng  the  learning  process   •  Understand  the  model,  understand  the  machine,  and   you  can  circumvent  it   •  Something  InfoSec  community  knows  very  well   •  Any  predic4ve  model  on  InfoSec  will  be  pushed  to  the   limit   •  Again,  think  back  on  the     way  SPAM  engines  evolved.!
  19. 19. Network  Security  Monitoring  
  20. 20. CorrelaEon  Rules:  A  Primer   •  Rules  in  a  SIEM  solu4on  invariably  are:   –  “Something”  has  happened  “x”  4mes;   –  “Something”  has  happened  and  other  “something2”  has   happened,  with  some  rela4onship  (4me,  same  fields,  etc)   between  them.   •  Configuring  SIEM  =  iterate  on  combina4ons  un4l:   –  Customer  or  management  is  foole..  I  mean  sa4sfied;     –  Consul4ng  money  runs  out   •  Behavioral  rules  (anomaly  detec4on)  helps  a  bit  with   the  “x”s,  but  s4ll,  very  laborious  and  4me   consuming.!
  21. 21. Kinds  of  Network  Security  Monitoring   •  Alert-­‐based:   –  “Tradi4onal”  log  management   –  SIEM   –  Using  “Threat  Intelligence”  (i.e   blacklists)  for  about  a  year  or   so   –  Lack  of  context   –  Low  effec4veness   –  You  get  the  results  handed   over  to  you   •  Explora4on-­‐based:   –  Network  Forensics  tools  (2/3   years  ago)   –  Elas4c  Search  based  LM   systems   –  High  effec4veness   –  Lots  of  people  necessary   –  Lots  of  HIGHLY  trained  people   •  Big  Data  Security  Analy4cs  (BDSA):   –  Run  explora4on-­‐based  monitoring  on  Hadoop   –  More  like  Big  Data  Security  Monitoring  (BDSM)  
  22. 22. Alert-­‐based  +  ExploraEon-­‐based  
  23. 23. A  wild  army  of  robots  appears  
  24. 24. Using  robots  to  catch  bad  guys  
  25. 25. PoC  ||  GTFO   •  We  developed  a  set  of  algorithms  to  detect  malicious   behavior  from  log  entries  of  firewall  blocks   •  Over  6  months  of  data  from  SANS  DShield  (thanks,  guys!)     •  A`er  a  lot  of  sta4s4cal-­‐based  math  (true  posi4ve  ra4o,   true  nega4ve  ra4o,  odds  likelihood),  it  could  pinpoint   actors  that  would  be  13x-­‐18x  more  likely  to  aRack  you.   •  Today  more  like  30x  on  the  SANS  data,  and  finding   around  80%  of  “badness”  in  par4cipant  deployments.!
  26. 26. Feature  IntuiEon:  IP  Proximity   •  Assump4ons  to  aggregate  the  data     •  Correla4on  /  proximity  /  similarity  BY  BEHAVIOR   •  “Bad  Neighborhoods”  concept:     –  Spamhaus  x  CyberBunker   –  Google  Report  (June  2013)   –  Moura  2013   •  Group  by  Geoloca4on   •  Group  by  Netblock  (/16,  /24)   •  Group  by  ASN     –  (thanks,  Team  Cymru)!
  27. 27. 0   10   MULTICAST  AND  FRIENDS   You  are   here! CN,   BR,   TH   Map  of  the   Internet   •  (Hilbert  Curve)   •  Block  port  22     •  2013-­‐07-­‐20   CN   127   RU  
  28. 28. Feature  IntuiEon:  Temporal  Decay   •  Even  bad  neighborhoods  renovate:   –  ARackers  may  change  ISPs/proxies   –  Botnets  may  be  shut  down  /  relocate   –  A  liRle  paranoia  is  Ok,  but  not  EVERYONE  is  out  to  get  you  (at  least   not  all  at  once)! •  As  days  pass,  let's  forget,  bit  by  bit,   who  aRacked   •  Last  4me  I  saw  this  actor,  and  how   o`en  did  I  see  them!
  29. 29. MLSec  Project   •  Behavior:  block   on  port  22   •  Trial  inference  on   100k  IP  addresses   per  Class  A   subnet   •  Logarithm    scale:   brightest  4les  are   10  to  1000  4mes   more  likely  to   aRack.  
  30. 30. Feature  IntuiEon:  DNS  features   •  Who  resolves  to  this  IP  address?   •  Number  of  domains  that  resolve  to  the  IP  address   •  Distribu4on  of  their  life4me   •  Entropy,  size,  ccTLDs   •  Registrar  informa4on   •  Reverse  DNS  informa4on…   •  History  of  DNS  registra4on…   •  (Thanks,  DNSDB!)  
  31. 31. Training  the  Model   •  YAY!  We  have  a  bunch  of  numbers  per  IP  address/domain!   •  How  do  you  define  what  is  malicious  or  not?   •  “Advanced  exper4se  in  both  informa4on  security  and  data   science  will  be  a  necessary  ingredient  in  enabling  accurate   discrimina4on  between  malicious  and  benign  ac4vity.  “          -­‐  Anton  Chuvakin,  Gartner   •  Kinda  easy  for  security  tools  (if  you  trust  them)   •  Web  applica4on  logs  need  deeper  sta4s4cal  analysis   •  Not  normal  /  standard  devia4on  thing     !
  32. 32. How  do  I  get  started  on  this?   •  Programming  is  a  must  (Python  /  R)   •  Sta4s4cal  knowledge  keeps  you  from  making  dumb   mistakes   •  Specific  machine  learning  courses  and  books:   –  Coursera  (ML/  Data  Analysis  /  Data  Science)   •  Prac4ce,  Prac4ce,  Prac4ce:   –  Explore  your  data!  –  (Security  Onion)   –  Kaggle   –  KDD,  VAST,  VizSec!
  33. 33. MLSec  Project   •  Sign  up,  send  logs,  receive  reports  generated  by  machine   learning  models!   •  Working  with  several  companies  on  trying  out  these  models  on   their  environment  with  their  data   •  We  are  hiring  (KINDA)   •  Visit  h]ps://  ,  message  @MLSecProject   or  just  e-­‐mail  me.!
  34. 34. MLSec  Project  -­‐  Current  Research   •  Inbound  aRacks  on  exposed  services  (DEFCON/BH  2013):   –  Informa4on  from  inbound  connec4ons  on  firewalls,  IPS,  WAFs   –  Feature  extrac4on  and  supervised  learning       •  Malware  Distribu4on  and  Botnets:   –  Informa4on  from  outbound  connec4ons  on  firewalls,  DNS  and   Web  Proxy   –  Ini4al  labeling  provided  by  intelligence  feeds  and  AV/an4-­‐malware   –  Semi-­‐supervised  learning  involved   •  Kill-­‐chain  Ensemble  Models:   –  Increased  precision  by  composing  different  behaviors   –  Web  server  path  -­‐>  go  through  Firewall,  then  IPS,  then  WAF   –  Early  confirma4on  of  aRack  failure  or  success  
  35. 35. Thanks!   •  Q&A?   •  Feedback?   Alexandre  Pinto     @alexcpsec   @MLSecProject   hRps://   "  Essen4ally,  all  models  are  wrong,  but  some  are  useful."                        -­‐  George  E.  P.  Box