BinaryPig - Scalable Malware Analytics in Hadoop

3,024 views
2,880 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,024
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
56
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

BinaryPig - Scalable Malware Analytics in Hadoop

  1. 1. BinaryPig: Scalable Binary Data Extraction in Hadoop Created By:
 Jason Trost, Telvis Calhoun, Zach Hanif
  2. 2. Bringing  data  science  to  cyber  security,   allowing  you  to  sense,  analyze  and  act  in   real  5me.    
  3. 3. Agenda   •  • The Problem •  • BinaryPig Architecture •  • Code and Implementation Details •  • Analysis and Results •  • Demo •  • Wrap-Up •  A
  4. 4. Background   2.5 years 20M samples 9.5TB of malware • 
  5. 5. Malware data mining is useful •  • Threat intel feeds
 • Contextual enrichment on events
 • Machine learning models
  6. 6. Pre-­‐BinaryPig:  Architecture  
  7. 7. Pre-­‐BinaryPig:  Storage  Issues   •  • We kept running out of disk! •  • We lost samples when NFS nodes failed.
  8. 8. Pre-­‐BinaryPig  -­‐  Processing  Issues   •  • No Data Locality. •  • Node failures were catastrophic. •  • Hard to add new analysis scripts.
  9. 9. Pre-­‐Binary  Pig  -­‐  Data  Explora=on  Issues   •  • How can I share my findings for greater fame and glory? •  • Create a table schema for every analysis script? •  • RDBMS failure is worse than zombie apocalypse.
  10. 10. We  needed  a  system  that...   •  • Scales to our historical data •  • Recovers from failures •  • Grows through scripting •  • Supports dynamic schemas •  • Searchable via the web
  11. 11. BinaryPig FRAMEWORK FOR PROCESSING SMALL BINARY FILES USING APACHE HADOOP AND APACHE PIG Bi
  12. 12. BinaryPig   •  • Simple DSL •  • Pluggable analytics •  • Plays nice with existing tools •  • Enables rapid iteration
  13. 13. BinaryPig    
  14. 14. BinaryPig  -­‐  Storage   •  • HDFS, scalable, replicated •  • Aggregate malware samples into sequence files
  15. 15. BinaryPig  -­‐  Processing   •  • Hadoop - robust, distributed, with data locality •  • Apache Pig - Extensible, simple
  16. 16. BinaryPig  -­‐  Results  Explora=on   •  • UI - turns your grandma into a data scientist •  • Elasticsearch - schemaless, replicated, awesome
  17. 17. Yet  Another  Framework?   •  Malware  tools  didn't  scale   •  Hadoop  does  not  play  well  with  small  binary   files   •  Hadoop  did  not  integrate  exis5ng  malware   analysis  tools    
  18. 18. Code  and  Implementa5on  Details  
  19. 19. BinaryPig  is  easy  to  use!  
  20. 20. BinaryPig  Ingest  Tools   •  Generate  sequence  file  from  directory   containing  malware  samples     ./bin/dir_to_sequencefile <malwareDir> <hdfsOutputFile> •  Generate  sequence  file  from  archive     ./bin/archive_to_sequencefile <archive> <hdfsOutputFile>
  21. 21. BinaryPig  Loaders   •  Converts  raw  data  to  a  tuple   •  Execu5ng  Loader   o  Executes  a  specific  script/program  on  a  file   wriIen  to  a  logical  path   o  Example:  Hashing   •  Daemon  Loader   o  Writes  binaries  to  a  path,  and  provides  those   paths  to  an  already  running  analysis  process   o  Example:  Clamd    
  22. 22. Op=miza=ons  in  BinaryPig   •  To  leverage  pre-­‐exis5ng  tools,  we  had  to  write   malware  binaries  to  the  local  filesystem  on  the   worker  nodes   o  Note:  local  copy,  not  network  copy   o  We  op5mized  this  to  use  /dev/shm/  instead   •  Quick  scripts  are  great  for  rapid  itera5on,  but...   o  Interpreter  startup  5me  can  dominate  execu5on  5me   o  Crea5ng  small,  long  running,  analy5c  daemons  provides  a   huge  speedup  for  frequently  used  tasks   o  i.e.  the  clamscand  model  of  execu5on    
  23. 23. BinaryPig:  Loader  Implementa=ons   •  Generic Script Loader •  Generic Daemon Loader •  ClamAV Loader •  Yara Loader •  Hashing Loader
  24. 24. strings.sh:     #!/bin/bash strings "$@" BinaryPig:  Scrip=ng   strings.pig:     define Loader com.endgame.binarypig.loaders.ExecutingTextLoader; data = LOAD '$INPUT' USING Loader('strings.sh'); DUMP data;
  25. 25. BinaryPig  supports  non-­‐PE32  files   •  Handles more than just malware... o  Image analysis o  PDF data extraction o  APK extraction o  Any small binary files
  26. 26. Web  Interface  
  27. 27. Analysis  and  Results  
  28. 28. Malware  Census   •  •  20  Million  unique  binaries   o  •  ~94%  PE  format   o  •  ~6%  are  mostly  Android  APK's   •  •  5  hours  to  run  historical  set  
  29. 29. General  Findings  
  30. 30. Feature  Extrac=on   •  Our  core  mo5va5on  was  to  dras5cally  improve  the   experience  of  valida5ng  research.   •  Packer  iden5fica5on   o  Overall  and  Sec5onal  Entropy   o  Kolmogorov  Complexity   o  Sec5on  and  resource  names   o  Sec5on  flags   •  Import  tables   •  Func5on  Calls   •  Resource  hashes  and  subfeatures  
  31. 31. Feature  Depth   •  PEHeaders  are  shallow   o  Easy  to  manipulate   o  Less  resolu5on  than  reverse  engineering  features   o  File  metadata  is  also  low  resolu5on   •  Headers  provide  excellent  fast  features   •  Headers  are  oben  ignored   •  Work  the  analysis  around  the  feature  resolu5on   o  Ignore  5ght  clusters,  go  for  wide  ones   o  Triage,  not  true  classifica5on  
  32. 32. Clustering  Results   •  Triage  for  dynamic  analysis  winnowing   •  Largest  cluster:  377,882  samples   o  Three  malware  families  contained  within     o  Second  largest:  124,894  samples   •  Valida5on  is  tricky   o  Manual  valida5on  cannot  be  en5rely  avoided   o  Cluster  meanings  change  with  feature  sets   o  Cannot  just  go  off  of  AV  results  
  33. 33. ICO  Extrac=on  
  34. 34. Icon  Features   •  Pixel  based  features   o  Brightness     o  Color  values   o  Pixel  density   •  Cryptographic  and  fuzzy  hashing   o  Perceptual  hashes   •  Edge  detec5on  
  35. 35. Icon  Results   •  Icon  clustering     o  Groups  do  not  just  include  family  lines   o  Copycat  malware  is  shown  as  well   o  Clear  indica5ons  of  malicious  intent   •  Method  of  infec5on  can  be  extrapolated   o  Phishing   o  Obfuscated  executables   o  Adware  (more  than  we  expected)   o  False  posi5ves  -­‐  popular  sobware  detec5on  
  36. 36. Lessons  Learned   •  Feature  Selec5on   o  Over  500  features  in  PEheader  alone   o  Abundance  of  features  requires  pruning   •  Interpreta5on  and  Valida5on  of  Results   o  Manual  valida5on  is  an  unfortunate  reality   o  Care  has  to  be  taken  to  ensure  that  unsupervised   learning  provides  meaningful  results  
  37. 37. DEMO  
  38. 38. WRAP  UP  
  39. 39. •  Rapid Iteration •  Feature extraction •  Clustering analysis for rapid malware triage •  Enables weekly AV scans with latest signatures over previous malware. •  Created binary classifier to improve sample collection and categorize new samples So  What?  
  40. 40. Future  work   •  • Compatibility with Pig 0.10.* and 0.11.* •  • EC2 tutorial •  • More examples/starter scripts o  • Inclusion of some of our Mahout tasks o  • Open source process for that is moving forward •  • Better error logging and handling o  Messages should be stored in a separate DB •  • Easier deployments o  • Analytic daemons o  • Dependency libs o  • Fabric/Salt/Puppet/Chef
  41. 41. BinaryPig  is  Open  Source!   https://github.com/endgameinc/binarypig
 Apache 2 License
  42. 42. We  are  hiring!   http://endgame.com/careers
  43. 43. QUESTIONS  

×