Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ln monitoring repositories


Published on

Talk done at Ruxcon 2012 on monitoring git repositories

Published in: Technology
  • Be the first to comment

Ln monitoring repositories

  1. 1. Monitoring repositories for FUN and PROFIT @snyff [_]
  2. 2. About me● Security consultant (C.T.O.)working for Securus Global in Melbourne● PentesterLab (.com): ○ cool/awesome (web) *free* training/exercises ○ real life scenario
  3. 3. Disclaimer● No code is going to be released today● No repositorieswere harmed duringthe preparation ofthis talk● I worked on Web and Open Source projects● I worked on commits without using the entire projects source code
  4. 4. Why work on commits?● Corporate development: ○ Cannot review all projects anymore ○ Nice to have a “what to check today” ○ Sort commits by criticality ○ Detect backdoors● Agile development: ○ The code changes every day ○ Can’t rely on one time code review anymore ○ Current approach: daily scan
  5. 5. Why work on commits?● You have vulnerabilities: ○ Detect patches affecting your bugs ○ Detect changes to sensitive functions
  6. 6. Why work on commits?● You want vulnerabilities ($$): ○ Detect new features with dangerous functions ○ Detect changes to sensitive functions
  7. 7. Why work on commits?● You want bugs (lulz): ○ Get bugs few hours before the patch is available ○ Get a list of bad practices examples ○ Detect silent patching
  8. 8. Whats a repository?● Developers● Files● Commits● And all of these are constantly moving...
  9. 9. Developers● Main developer(s): ○ Add features ○ Fix bugs● Cosmetic committer(s): ○ Change comments (fix typo) ○ Change designs of the website ○ Change indentation ○ Add documentation● External people ○ Do a bit of everything
  10. 10. Files● README/LICENSE files● Templates, HTML, CSS● Images● Code: ○ Libraries ○ Installation code ○ "normal" code
  11. 11. Commits● Developers name● Code changes: ○ Changes: diff ○ Files changed ○ Number of deletion/addition● Date/Time of the commit● Message
  12. 12. Examples of projects monitored
  13. 13. Stats (on the last 5000 commits)● Commits per week: ○ anywhere between 20 and 180 (phpmyadmin) per week ○ 40 commits per week seems to be the average for "normal/interesting" projects● Authors: ○ between 1 and 140● Average commit: 200 lines (insertions+deletions)
  14. 14. Goals...
  15. 15. Goals: counterexample
  16. 16. Goals: example
  17. 17. Goals: example
  18. 18. Goals: example
  19. 19. Filtering...
  20. 20. Filtering files● General approach: ○ images ○ css ○ README● Framework based: ○ tests (interesting to keep for some projects) ○ database migration/creation script● Project based files ○ deployment ○ installation files
  21. 21. Filtering developers● For a given project find the "cosmetic developers"● Dont get me wrong they are not useless, they just do things i dont care about
  22. 22. Results● Around 5-10% of commits have nothing to do with code...● You can divide the size of most other commits by 2-3 if you ignore noise (files/comments/...): ○ new code with test cases ○ modification in comments ○ ...
  23. 23. Classification
  24. 24. Data mining● Take your samples (commits) ○ Extract a vector from each sample ○ Classify each sample● From a training set, learn to classify the data● Apply what you learned: ○ to the same training set after splitting it (cross- validation) ○ to new samples
  25. 25. Data mining● training set: [1,2,3,0,10,220 ] -> bugfix [2,4,3,0,1,0 ] -> boring [2,5,3,3,1,1 ] -> boring [20,1,0,100,0,10 ] -> new bug● testing: [23,0,1,90,0,15 ] -> ???
  26. 26. Extracting a vector● You cant really say a commit is close to another commit● You need to generate a vector from each commit to compare them● Once you have done that, everything else is just magic^W Maths
  27. 27. Extracting a vector: getting data● Number of lines changed: ○ insertion vs deletion● Number of words changed (--word-diff): ○ insertion vs deletion● Authors: ○ rating of authors based on the projects history ■ "fixing" score ■ "vulnerability creator" score ○ new developers ○ known security researchers
  28. 28. Extracting a vector: getting data● Number of "dangerous" functions: ○ insertion ○ deletion● Number of "filtering" functions: ○ insertion ○ deletion● commit date vs author date● Keywords in the message and in the code
  29. 29. Extracting a vector: getting data● Files modified: ○ already implicated in a bug fix ○ already implicated in a vulnerability
  30. 30. Filtering vs Dangerous● Good list of "dangerous" signatures from graudit: ○● Weighting is *really* important: ○ echo -> potential XSS -> 1 point ○ system -> potential commands execution -> 10 points● Some functions are in both: ○ crypto functions for example ○ crypto can be dangerous and but can filter as well
  31. 31. Filtering vs Dangerous attr_protectedsystem attr_protected htmlentities echo attr_accessible echo File.basename eval exec preg_replace assert preg_replace basename create_function intvalopen3 popen send escape
  32. 32. Keywords Code vulnerability execution SQL injection Cross Site punctuation CSS rules Scripting XSS Directory disclosureDocumentation CVE traversal Version number Dangerous CSS SecurityTypo selector Command exec description Changelog CSRF Risky
  33. 33. Classification● Fixed bugs: ○ learn from dangerous keywords● New bugs: ○ git blame ○ read the source code and classify manually● Potentially interesting new feature: ○ read the source code ○ can be a new bug
  34. 34. Results● Vector computation: ○ between 15 and 120 minutes for 5000 commits● Classification: ○ less than a minute● Scoring: ○ 90% success rate on bug fix (without using the message as part of the vector) ○ 50/50 between FP and FN on bug fix ○ 200 commits down to 5-10 bugs per day
  35. 35. My tool: SANZARU● Japanese names for tools make you a Ninja ;)● Ruby based (what else...)● Data Mining done with Weka (thx Silvio)
  36. 36. SANZARU: virtuous circle● Made in a way that the more you learn on a project the more effective it gets :)● Score authors through learning● Score files through learning● add functions used by the project
  37. 37. SANZARU: "learning mode"● take the last 5k commits and give you the list of impacted files and authors with a weight● still working on finding the initial bugs author but it doesnt really give you more information
  38. 38. SANZARU: configuration fileconfigure({ :path => "/home/snyff/code/rails", :type => :git, :remote => "origin/master", :origin => "", :languages => [ :ruby ] })filter({ :extensions => [ :html, :css, :jpg, :png, :md, :tpl ], :files => ["LICENSE", "*test*"] })alert({ :keywords => [keywords_default] ... })
  39. 39. SANZARU: configuration fileclassify(:authors => { :default => 0,""=>19,""=>15,""=>8, ""=>11,""=>25, .... }, :files => { :default => 0,"activemodel/lib/active_model/mass_assignment_security.rb"=>20, "railties/lib/rails/application.rb"=>17, "actionpack/lib/action_view/helpers/form_helper.rb"=>17, "activerecord/lib/active_record/core.rb"=>17, ...})
  40. 40. SANZARU: "classification mode"● Using ruby to create all the vectors● Using weka to classify the data● Then manual review of the results: ○ New features to find security bugs ○ FP for possible silent patching
  41. 41. SANZARU: "daily mode"● Cron job (every day) ○ update all repositories (hasnt been blacklisted by github...yet), ruby-git is *shit* ○ find alerts in new commits ○ classify new commits ○ give me a nice report with what to read
  42. 42. SANZARU: example of output
  43. 43. Example found this week (notexploitable... yet): esc_js escapes and "... this doesnt
  44. 44. Example found this week:
  45. 45. Example found this morning:
  46. 46. General observations● Most fixes are: ○ small code insertion (less than 10 lines) ○ basic line substitution ○ easy to detect● Most new bugs are: ○ details... ○ really hard to detect statistically ○ general approach: read all potentially interesting commits ○ working on important projects make the creation of bugs far less likely ○ its not going to rain 0dayz...
  47. 47. Possible improvements● Integrating syntactic analysis: ○ regular expression are just not enough ○ False alerts are time consuming...● Retrieve information from external sources: ○ bug report ○ CVE● Support for more languages/platforms: ○ Objective C libraries and applications? ○ Linux kernel? ○ ...
  48. 48. Conclusion● Easy to detect: ○ (Silent) Security Fixes ○ New features with "interesting" functions● Not so easy to detect ○ New security bugs● Still worth the time ○ if you want bugs ○ if you are doing code review to have examples to learn from or share: vulnerability patterns ○ most frustrating thing you can do?
  49. 49. Questions? @snyff● Have a great Ruxcon● Play the CTF and Lock Picking● Remember to checkout: ○ ○ @PentesterLab● Thx to everyone who helped meputting this talk together