Malware, Mongo and Map ReduceBrandon Dixon – 9b+
Who I Am¤  Security Researcher¤  GWU CERT                        9b+¤  Past   ¤  Security Consultant @ G2, Inc.   ¤  ...
Agenda¤  PDF Woes¤  Overview of PDF X-RAY¤  Why Mongo?¤  Challenges¤  Old Questions with New Answers¤  Conclusions
The PDF Problem•  Extremely diverse and   flexible•  Heavily used by   attackers•  Difficult to parse and   identify malic...
Oh, but there’s more!PDF A (200KB)                 PDF B (3MB)¤  11 Objects                ¤  300 Objects¤  291 Names/D...
PDF X-RAYProcess                  Stats1.    Upload PDF         ¤  10 collections2.    Convert to JSON    ¤  ~80K docume...
Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Hear...
Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Hear...
=   {   one document          to rule them all.   }
Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Hear...
No Thanks
Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Hear...
Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Hear...
http://www.youtube.com/watch?v=b2F-DItXtZs
MapReduce Fun¤  How many unique named functions occur across all    malicious documents compared to good ones?¤  Despite...
Question #1: ResultsGood              Bad
MapReduce Fun¤  How many unique named functions occur across all    malicious documents compared to good ones?¤  Despite...
Question #2: ResultsGood              Bad
MapReduce Fun¤  How many unique named functions occur across all    malicious documents compared to good ones?¤  Despite...
Question #3: Results
Gripes¤  Size limits¤  MapReduce troubleshooting¤  Restoring collections into each other (no upsert)¤  Getting the ent...
Kudos 10gen¤  Size limits keep getting bumped up¤  Output shaping is in testing¤  Simple aggregation through queries is...
Questions and ContactBrandon Dixonbrandon@9bplus.comwww.9bplus.comblog.9bplus.comwww.pdfxray.com@9bplus
Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem
Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem
Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem
Upcoming SlideShare
Loading in …5
×

Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem

1,111 views

Published on

This presentation demonstrates how NoSQL technologies were used to solve a difficult analytical problem that traditional SQL databases could not. PDF malware has been on the rise for the past few years and has become one of the most successful methods for attackers to gain unauthorized access into a network. The standard way to do malware analysis in PDF documents has been very independent in nature. Commercial entities do not share their data so researchers must fend for themselves and more often than not, researchers analyze a PDF file independent of other malicious PDF files. I found this static approach to be highly inefficient, but storing multiple PDF documents in a database was a problem in itself. Traditional SQL databases didn’t seem like the right fit given their forced constraints and true relational models.

PDF files also contain a lot of dynamic data that make them a tough fit in a traditional SQL model. PDF A could contain 40 objects where as PDF B could contain 3,000 objects. Scaling this out becomes quite difficult and messy. When looking at this problem, I ideally wanted to solve a number of issues at once. I needed a good way to share a PDF samples, an easy way to query on a corpus of documents and the ability to efficiently get my data back out so I could display it elsewhere.

With all my samples in a JSON format, MongoDB just made sense as it could take in these objects and allow me to query on them as a whole or independently. MongoDB also provided me with a rich tool-set to further answer questions that had never been posed before. By using single and multi-step map/reduce jobs, I was able to aggregate PDF characteristics and apply simple averaging to identify shared commonalities between malicious documents.

Though I have had great outcomes and successes with MongoDB, there have also been annoyances and the unexplained details. These issues pinned me against a wall for days and sometimes had me wondering if I had picked the wrong model for tackling this problem. At the end of the day though I was able to overcome these issues and account for them without hassle.

This talk will cover new research methods and tools created using NoSQL technologies to analyze PDF documents in a more efficient manner that promotes collaboration among the community. It will also serve as a step forward in detecting malicious PDFs by looking at them from a statistical standpoint. When I look back on the choice to use MongoDB, I think I made a great decision. I can’t begin to think how I would have handled processing thousands of unique named function calls with multiple attributes for each PDF or running multi-step map/reduce jobs against a Mongo document instead of a blob in a SQL database. MongoDB provided me with a rich functionality that easily led me to success on my project.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,111
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem

  1. 1. Malware, Mongo and Map ReduceBrandon Dixon – 9b+
  2. 2. Who I Am¤  Security Researcher¤  GWU CERT 9b+¤  Past ¤  Security Consultant @ G2, Inc. ¤  SMT/Network Engineer @ Windermere¤  Focus ¤  PDF Malware Analysis ¤  Messaging Technologies
  3. 3. Agenda¤  PDF Woes¤  Overview of PDF X-RAY¤  Why Mongo?¤  Challenges¤  Old Questions with New Answers¤  Conclusions
  4. 4. The PDF Problem•  Extremely diverse and flexible•  Heavily used by attackers•  Difficult to parse and identify malicious content•  Widely distributed http://www.zdnet.com/blog/security/study-6-out-of-every-10-users-run-vulnerable-adobe-reader/9014
  5. 5. Oh, but there’s more!PDF A (200KB) PDF B (3MB)¤  11 Objects ¤  300 Objects¤  291 Names/Dicts ¤  158 Names/Dicts¤  Metadata ¤  Partial metadata¤  Filters for compression ¤  No filters applied¤  Embedded documents ¤  References to outside sites¤  Multiple updates ¤  Single update
  6. 6. PDF X-RAYProcess Stats1.  Upload PDF ¤  10 collections2.  Convert to JSON ¤  ~80K documents3.  Store in Mongo ¤  ~3GB of data4.  Create report ¤  PyMongo5.  User is informed ¤  Mongo ¤  Django
  7. 7. Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Heard it was web-scale
  8. 8. Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Heard it was web-scale
  9. 9. = { one document to rule them all. }
  10. 10. Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Heard it was web-scale
  11. 11. No Thanks
  12. 12. Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Heard it was web-scale
  13. 13. Why Mongo?¤  JSON in, JSON out¤  Documents treated independently¤  No fixed layout or schema¤  MapReduce power¤  Heard it was web-scale
  14. 14. http://www.youtube.com/watch?v=b2F-DItXtZs
  15. 15. MapReduce Fun¤  How many unique named functions occur across all malicious documents compared to good ones?¤  Despite not being required, is metadata ever useful as an indicator when classifying a PDF?¤  What can we glean about Anti-Virus software based on stored reports (assuming they are available)?
  16. 16. Question #1: ResultsGood Bad
  17. 17. MapReduce Fun¤  How many unique named functions occur across all malicious documents compared to good ones?¤  Despite not being required, is metadata ever useful as an indicator when classifying a PDF?¤  What can we glean about Anti-Virus software based on stored reports (assuming they are available)?
  18. 18. Question #2: ResultsGood Bad
  19. 19. MapReduce Fun¤  How many unique named functions occur across all malicious documents compared to good ones?¤  Despite not being required, is metadata ever useful as an indicator when classifying a PDF?¤  What can we glean about Anti-Virus software based on stored reports (assuming they are available)?
  20. 20. Question #3: Results
  21. 21. Gripes¤  Size limits¤  MapReduce troubleshooting¤  Restoring collections into each other (no upsert)¤  Getting the entire object back on specific queries¤  Lack of triggers
  22. 22. Kudos 10gen¤  Size limits keep getting bumped up¤  Output shaping is in testing¤  Simple aggregation through queries is in testing¤  Google group responses are quick
  23. 23. Questions and ContactBrandon Dixonbrandon@9bplus.comwww.9bplus.comblog.9bplus.comwww.pdfxray.com@9bplus

×