BinaryPig: Scalable Binary Data
Extraction in Hadoop
Created By:

Jason Trost, Telvis Calhoun, Zach Hanif
Bringing	
  data	
  science	
  to	
  cyber	
  security,	
  
allowing	
  you	
  to	
  sense,	
  analyze	
  and	
  act	
  in...
Agenda	
  
•  • The Problem
•  • BinaryPig Architecture
•  • Code and Implementation Details
•  • Analysis and Results
•  ...
Background	
  
2.5 years
20M samples
9.5TB of malware
• 
Malware data mining is useful
•  • Threat intel feeds

• Contextual enrichment on events

• Machine learning models
Pre-­‐BinaryPig:	
  Architecture	
  
Pre-­‐BinaryPig:	
  Storage	
  Issues	
  
•  • We kept running out of disk!
•  • We lost samples when NFS nodes failed.
Pre-­‐BinaryPig	
  -­‐	
  Processing	
  Issues	
  
•  • No Data Locality.
•  • Node failures were catastrophic.
•  • Hard ...
Pre-­‐Binary	
  Pig	
  -­‐	
  Data	
  Explora=on	
  Issues	
  
•  • How can I share my findings for greater fame and glory...
We	
  needed	
  a	
  system	
  that...	
  
•  • Scales to our historical data 
•  • Recovers from failures 
•  • Grows thr...
BinaryPig

FRAMEWORK FOR
PROCESSING SMALL
BINARY FILES USING
APACHE HADOOP AND
APACHE PIG
Bi
BinaryPig	
  
•  • Simple DSL
•  • Pluggable analytics
•  • Plays nice with existing tools
•  • Enables rapid iteration
BinaryPig	
  	
  
BinaryPig	
  -­‐	
  Storage	
  
•  • HDFS, scalable, replicated
•  • Aggregate malware samples into sequence files
BinaryPig	
  -­‐	
  Processing	
  
•  • Hadoop - robust, distributed, with data locality
•  • Apache Pig - Extensible, sim...
BinaryPig	
  -­‐	
  Results	
  Explora=on	
  
•  • UI - turns your grandma into a data scientist
•  • Elasticsearch - sche...
Yet	
  Another	
  Framework?	
  
•  Malware	
  tools	
  didn't	
  scale	
  
•  Hadoop	
  does	
  not	
  play	
  well	
  wi...
Code	
  and	
  Implementa5on	
  Details	
  
BinaryPig	
  is	
  easy	
  to	
  use!	
  
BinaryPig	
  Ingest	
  Tools	
  
•  Generate	
  sequence	
  file	
  from	
  directory	
  
containing	
  malware	
  samples	...
BinaryPig	
  Loaders	
  
•  Converts	
  raw	
  data	
  to	
  a	
  tuple	
  
•  Execu5ng	
  Loader	
  
o  Executes	
  a	
  ...
Op=miza=ons	
  in	
  BinaryPig	
  
•  To	
  leverage	
  pre-­‐exis5ng	
  tools,	
  we	
  had	
  to	
  write	
  
malware	
 ...
BinaryPig:	
  Loader	
  Implementa=ons	
  
•  Generic Script Loader
•  Generic Daemon Loader
•  ClamAV Loader
•  Yara Load...
strings.sh:	
  
	
  
#!/bin/bash
strings "$@"
BinaryPig:	
  Scrip=ng	
  
strings.pig:	
  
	
  
define Loader com.endgame.b...
BinaryPig	
  supports	
  non-­‐PE32	
  files	
  
•  Handles more than just malware...
o  Image analysis
o  PDF data extract...
Web	
  Interface	
  
Analysis	
  and	
  Results	
  
Malware	
  Census	
  
•  •	
  20	
  Million	
  unique	
  binaries	
  
o  •	
  ~94%	
  PE	
  format	
  
o  •	
  ~6%	
  are	...
General	
  Findings	
  
Feature	
  Extrac=on	
  
•  Our	
  core	
  mo5va5on	
  was	
  to	
  dras5cally	
  improve	
  the	
  
experience	
  of	
  v...
Feature	
  Depth	
  
•  PEHeaders	
  are	
  shallow	
  
o  Easy	
  to	
  manipulate	
  
o  Less	
  resolu5on	
  than	
  re...
Clustering	
  Results	
  
•  Triage	
  for	
  dynamic	
  analysis	
  winnowing	
  
•  Largest	
  cluster:	
  377,882	
  sa...
ICO	
  Extrac=on	
  
Icon	
  Features	
  
•  Pixel	
  based	
  features	
  
o  Brightness	
  	
  
o  Color	
  values	
  
o  Pixel	
  density	
 ...
Icon	
  Results	
  
•  Icon	
  clustering	
  	
  
o  Groups	
  do	
  not	
  just	
  include	
  family	
  lines	
  
o  Copy...
Lessons	
  Learned	
  
•  Feature	
  Selec5on	
  
o  Over	
  500	
  features	
  in	
  PEheader	
  alone	
  
o  Abundance	
...
DEMO	
  
WRAP	
  UP	
  
•  Rapid Iteration
•  Feature extraction
•  Clustering analysis for rapid malware triage
•  Enables weekly AV scans with l...
Future	
  work	
  
•  • Compatibility with Pig 0.10.* and 0.11.*
•  • EC2 tutorial
•  • More examples/starter scripts
o  •...
BinaryPig	
  is	
  Open	
  Source!	
  
https://github.com/endgameinc/binarypig

Apache 2 License
We	
  are	
  hiring!	
  
http://endgame.com/careers
QUESTIONS	
  
BinaryPig - Scalable Malware Analytics in Hadoop
Upcoming SlideShare
Loading in …5
×

BinaryPig - Scalable Malware Analytics in Hadoop

2,777
-1

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,777
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
54
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

BinaryPig - Scalable Malware Analytics in Hadoop

  1. 1. BinaryPig: Scalable Binary Data Extraction in Hadoop Created By:
 Jason Trost, Telvis Calhoun, Zach Hanif
  2. 2. Bringing  data  science  to  cyber  security,   allowing  you  to  sense,  analyze  and  act  in   real  5me.    
  3. 3. Agenda   •  • The Problem •  • BinaryPig Architecture •  • Code and Implementation Details •  • Analysis and Results •  • Demo •  • Wrap-Up •  A
  4. 4. Background   2.5 years 20M samples 9.5TB of malware • 
  5. 5. Malware data mining is useful •  • Threat intel feeds
 • Contextual enrichment on events
 • Machine learning models
  6. 6. Pre-­‐BinaryPig:  Architecture  
  7. 7. Pre-­‐BinaryPig:  Storage  Issues   •  • We kept running out of disk! •  • We lost samples when NFS nodes failed.
  8. 8. Pre-­‐BinaryPig  -­‐  Processing  Issues   •  • No Data Locality. •  • Node failures were catastrophic. •  • Hard to add new analysis scripts.
  9. 9. Pre-­‐Binary  Pig  -­‐  Data  Explora=on  Issues   •  • How can I share my findings for greater fame and glory? •  • Create a table schema for every analysis script? •  • RDBMS failure is worse than zombie apocalypse.
  10. 10. We  needed  a  system  that...   •  • Scales to our historical data •  • Recovers from failures •  • Grows through scripting •  • Supports dynamic schemas •  • Searchable via the web
  11. 11. BinaryPig FRAMEWORK FOR PROCESSING SMALL BINARY FILES USING APACHE HADOOP AND APACHE PIG Bi
  12. 12. BinaryPig   •  • Simple DSL •  • Pluggable analytics •  • Plays nice with existing tools •  • Enables rapid iteration
  13. 13. BinaryPig    
  14. 14. BinaryPig  -­‐  Storage   •  • HDFS, scalable, replicated •  • Aggregate malware samples into sequence files
  15. 15. BinaryPig  -­‐  Processing   •  • Hadoop - robust, distributed, with data locality •  • Apache Pig - Extensible, simple
  16. 16. BinaryPig  -­‐  Results  Explora=on   •  • UI - turns your grandma into a data scientist •  • Elasticsearch - schemaless, replicated, awesome
  17. 17. Yet  Another  Framework?   •  Malware  tools  didn't  scale   •  Hadoop  does  not  play  well  with  small  binary   files   •  Hadoop  did  not  integrate  exis5ng  malware   analysis  tools    
  18. 18. Code  and  Implementa5on  Details  
  19. 19. BinaryPig  is  easy  to  use!  
  20. 20. BinaryPig  Ingest  Tools   •  Generate  sequence  file  from  directory   containing  malware  samples     ./bin/dir_to_sequencefile <malwareDir> <hdfsOutputFile> •  Generate  sequence  file  from  archive     ./bin/archive_to_sequencefile <archive> <hdfsOutputFile>
  21. 21. BinaryPig  Loaders   •  Converts  raw  data  to  a  tuple   •  Execu5ng  Loader   o  Executes  a  specific  script/program  on  a  file   wriIen  to  a  logical  path   o  Example:  Hashing   •  Daemon  Loader   o  Writes  binaries  to  a  path,  and  provides  those   paths  to  an  already  running  analysis  process   o  Example:  Clamd    
  22. 22. Op=miza=ons  in  BinaryPig   •  To  leverage  pre-­‐exis5ng  tools,  we  had  to  write   malware  binaries  to  the  local  filesystem  on  the   worker  nodes   o  Note:  local  copy,  not  network  copy   o  We  op5mized  this  to  use  /dev/shm/  instead   •  Quick  scripts  are  great  for  rapid  itera5on,  but...   o  Interpreter  startup  5me  can  dominate  execu5on  5me   o  Crea5ng  small,  long  running,  analy5c  daemons  provides  a   huge  speedup  for  frequently  used  tasks   o  i.e.  the  clamscand  model  of  execu5on    
  23. 23. BinaryPig:  Loader  Implementa=ons   •  Generic Script Loader •  Generic Daemon Loader •  ClamAV Loader •  Yara Loader •  Hashing Loader
  24. 24. strings.sh:     #!/bin/bash strings "$@" BinaryPig:  Scrip=ng   strings.pig:     define Loader com.endgame.binarypig.loaders.ExecutingTextLoader; data = LOAD '$INPUT' USING Loader('strings.sh'); DUMP data;
  25. 25. BinaryPig  supports  non-­‐PE32  files   •  Handles more than just malware... o  Image analysis o  PDF data extraction o  APK extraction o  Any small binary files
  26. 26. Web  Interface  
  27. 27. Analysis  and  Results  
  28. 28. Malware  Census   •  •  20  Million  unique  binaries   o  •  ~94%  PE  format   o  •  ~6%  are  mostly  Android  APK's   •  •  5  hours  to  run  historical  set  
  29. 29. General  Findings  
  30. 30. Feature  Extrac=on   •  Our  core  mo5va5on  was  to  dras5cally  improve  the   experience  of  valida5ng  research.   •  Packer  iden5fica5on   o  Overall  and  Sec5onal  Entropy   o  Kolmogorov  Complexity   o  Sec5on  and  resource  names   o  Sec5on  flags   •  Import  tables   •  Func5on  Calls   •  Resource  hashes  and  subfeatures  
  31. 31. Feature  Depth   •  PEHeaders  are  shallow   o  Easy  to  manipulate   o  Less  resolu5on  than  reverse  engineering  features   o  File  metadata  is  also  low  resolu5on   •  Headers  provide  excellent  fast  features   •  Headers  are  oben  ignored   •  Work  the  analysis  around  the  feature  resolu5on   o  Ignore  5ght  clusters,  go  for  wide  ones   o  Triage,  not  true  classifica5on  
  32. 32. Clustering  Results   •  Triage  for  dynamic  analysis  winnowing   •  Largest  cluster:  377,882  samples   o  Three  malware  families  contained  within     o  Second  largest:  124,894  samples   •  Valida5on  is  tricky   o  Manual  valida5on  cannot  be  en5rely  avoided   o  Cluster  meanings  change  with  feature  sets   o  Cannot  just  go  off  of  AV  results  
  33. 33. ICO  Extrac=on  
  34. 34. Icon  Features   •  Pixel  based  features   o  Brightness     o  Color  values   o  Pixel  density   •  Cryptographic  and  fuzzy  hashing   o  Perceptual  hashes   •  Edge  detec5on  
  35. 35. Icon  Results   •  Icon  clustering     o  Groups  do  not  just  include  family  lines   o  Copycat  malware  is  shown  as  well   o  Clear  indica5ons  of  malicious  intent   •  Method  of  infec5on  can  be  extrapolated   o  Phishing   o  Obfuscated  executables   o  Adware  (more  than  we  expected)   o  False  posi5ves  -­‐  popular  sobware  detec5on  
  36. 36. Lessons  Learned   •  Feature  Selec5on   o  Over  500  features  in  PEheader  alone   o  Abundance  of  features  requires  pruning   •  Interpreta5on  and  Valida5on  of  Results   o  Manual  valida5on  is  an  unfortunate  reality   o  Care  has  to  be  taken  to  ensure  that  unsupervised   learning  provides  meaningful  results  
  37. 37. DEMO  
  38. 38. WRAP  UP  
  39. 39. •  Rapid Iteration •  Feature extraction •  Clustering analysis for rapid malware triage •  Enables weekly AV scans with latest signatures over previous malware. •  Created binary classifier to improve sample collection and categorize new samples So  What?  
  40. 40. Future  work   •  • Compatibility with Pig 0.10.* and 0.11.* •  • EC2 tutorial •  • More examples/starter scripts o  • Inclusion of some of our Mahout tasks o  • Open source process for that is moving forward •  • Better error logging and handling o  Messages should be stored in a separate DB •  • Easier deployments o  • Analytic daemons o  • Dependency libs o  • Fabric/Salt/Puppet/Chef
  41. 41. BinaryPig  is  Open  Source!   https://github.com/endgameinc/binarypig
 Apache 2 License
  42. 42. We  are  hiring!   http://endgame.com/careers
  43. 43. QUESTIONS  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×