FooCodeChuServices for software analysis, malwaredetection, and vulnerability researchSilvio Cesare <silvio.cesare@gmail.c...
Who am I and why this talk?• Ph.D. Student at Deakin University• Book Author• This talk covers some of my publically acces...
Introduction• Research on software analysis, similarity, and  classification ▫   Malware detection and attribution ▫   Inc...
Outline• Simseer• Clonewise• Bugwise• Future Work and Conclusion
Software similarity and visualisation
Motivation• Many applications of software similarity  ▫ Malware detection  ▫ Plagiarism detection  ▫ Software theft detect...
Program Representation    lea     0x4(%esp),%ecx    and     $0xfffffff0,%esp                    Proc_0    pushl   -0x4(%ec...
Simseer Program Fingerprint• Set of control flow graphs• Many procedures
Decompilation of a Control Flow Graph                                   proc(){                     L_0           L_0:    ...
Q-Grams• Input is decompiled strings• Extract all possible fixed size substrings (q-  grams)• Train 500 dominant q-grams  ...
Program Similarity• 500 q-grams make a „feature vector‟• Similarity using vector distance
Software similarity search                           Query Benign                                           r             ...
Future Work• Give access to more classes of program  „fingerprints‟ ▫ Call graphs ▫ Opcodes ▫ Different similarity measures
Simseer summary• Simseer is effective• Efficient• Web service is free for public use
Detecting package clones and inferring security problems
Motivation• Developers may “embed” or “clone” software  from 3rd party sources ▫ Maintaining an internal copy of a library...
Feature Extraction – Shared packageclone detection                1.    N_Filenames_A                2.    N_Filenames_Sou...
Classification• Consider feature vectors as n-dimensional  points in space.• Linear classifiers• Non-linear classifiers• D...
Feature Extraction – Embedded clonedetection               1. N_Filenames_A               2. N_Filenames_Source_A         ...
Detecting copyright violations1. Identify embedded package clones.2. Extract license information of each package.3. For ea...
Automated Vulnerability Inference1. Take CVE, match CPE name to Debian package.2. Parse CVE summary and extract vuln filen...
Package clone detection use-case
Finding Vulnerabilities
Shared package clone evaluation Classifier      TP/FN     FP/TN       TP Rate   FP Rate Naïve Bayes     439/322   484/5629...
Embedded clone detection evaluation Classifier      TP/FN     FP/TN       TP Rate   FP Rate Naïve Bayes     718/43    6341...
Automatic detection of suspiciousclones    PACKAGE           EMBEDDED PACKAGE    freevo            feedparser    hedgewars...
Future Work• Binary-level clone detection• Integrate into Linux distributions• Linux security teams usage
Clonewise summary• Practical clone detection in Linux• Improves manual only tracking• Has found bugs• Debian Linux want to...
Detecting bugs in binaries using decompilation and data flowanalysis
Motivation• Detecting bugs in binary is useful ▫   Black-box penetration testing ▫   External audits and compliance ▫   Qu...
Wire – A formal language for binaryanalysis• x86 is complex and big• Wire is a low level RISC assembly style language• Tra...
Stack Pointer Inference• Proposed in HexRays decompiler -  http://www.hexblog.com/?p=42• Estimate Stack Pointer (SP) in an...
Decompilation - Local Variable Recovery  • Based on stack pointer inference  • Access to memory offset to the stack  • Rep...
Data Flow Analysis - ReachingDefinitions• A reaching definition is a definition of a variable  that reaches a program poin...
More data flow problems• Upward Exposed Uses  ▫ All uses of a definition• Live Variables  ▫ A variable is live if it will ...
getenv() bugs• Detect unsafe applications of getenv()• Example: strcpy(buf,getenv(“HOME”))• For each getenv() ▫ If return ...
Use-after-free Detection• For each free(ptr) ▫ If ptr live                       void f(int x) ▫ Then warn           {    ...
Double Free Detection• For each free(ptr) ▫ If an upward exposed use of ptr‟s definition is   free(ptr)                   ...
getenv() bugs•   Scanned entire Debian 7 unstable repository•   ~123,000 ELF binaries                            4digits  ...
getenv() bugs over time –sorted by binary size• Linear or power growth?
getenv() bug statistics• Probability (P) of a binary being vulnerable: 0.00067• P. of a package being vulnerable: 0.00255 ...
Double free in SGID games “xonix”  memset(score_rec[i].login, 0, 11);  strncpy(score_rec[i].login, pw->pw_name, 10);  mems...
Future Work• Core ▫   Summary-based interprocedural analysis ▫   Context sensitive interprocedural analysis ▫   Pointer an...
Bugwise summary• Practical tool to find simple bugs• Based on strong theory• Extensible• Much work to do in the future• We...
Future Work• Make more of my research public• Provide better backend infrastructure• Get people to use the services!
Conclusion• All of the tools in this talk are for public use• http://www.FooCodeChu.com  ▫ Wiki on software similarity and...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
Upcoming SlideShare
Loading in …5
×

FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research

1,279 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,279
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research

  1. 1. FooCodeChuServices for software analysis, malwaredetection, and vulnerability researchSilvio Cesare <silvio.cesare@gmail.com>
  2. 2. Who am I and why this talk?• Ph.D. Student at Deakin University• Book Author• This talk covers some of my publically accessible Ph.D. research.
  3. 3. Introduction• Research on software analysis, similarity, and classification ▫ Malware detection and attribution ▫ Incident response ▫ Plagiarism detection ▫ Software theft detection ▫ Vulnerability research• Three academic research tools free to use on my website.
  4. 4. Outline• Simseer• Clonewise• Bugwise• Future Work and Conclusion
  5. 5. Software similarity and visualisation
  6. 6. Motivation• Many applications of software similarity ▫ Malware detection ▫ Plagiarism detection ▫ Software theft detection• Traditional string signatures are ineffective• Modern fingerprints effective but in many case inefficient
  7. 7. Program Representation lea 0x4(%esp),%ecx and $0xfffffff0,%esp Proc_0 pushl -0x4(%ecx) push %ebp mov %esp,%ebp push %ecx sub $0x24,%esp call 4011b0 <___main> movl $0x0,-0x8(%ebp) jmp 40115f <_main+0x2f> Proc_1 Proc_3 movl $0x4020a0,(%esp) call 4011b8 <_puts> addl $0x1,-0x8(%ebp) cmpl $0x9,-0x8(%ebp) Proc_4 jle 40114f <_main+0x1f> add $0x24,%esp pop %ecx pop %ebp Proc_2 lea -0x4(%ecx),%esp ret
  8. 8. Simseer Program Fingerprint• Set of control flow graphs• Many procedures
  9. 9. Decompilation of a Control Flow Graph proc(){ L_0 L_0: W|IEH}R while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: } L_1 L_7 L_5: true } true L_7: return; L_2 L_4 } true L_5
  10. 10. Q-Grams• Input is decompiled strings• Extract all possible fixed size substrings (q- grams)• Train 500 dominant q-grams W|IE |IEH W|IEH}R IEH} EH}R
  11. 11. Program Similarity• 500 q-grams make a „feature vector‟• Similarity using vector distance
  12. 12. Software similarity search Query Benign r q distance(p,q) p Query Malicious Query Malware
  13. 13. Future Work• Give access to more classes of program „fingerprints‟ ▫ Call graphs ▫ Opcodes ▫ Different similarity measures
  14. 14. Simseer summary• Simseer is effective• Efficient• Web service is free for public use
  15. 15. Detecting package clones and inferring security problems
  16. 16. Motivation• Developers may “embed” or “clone” software from 3rd party sources ▫ Maintaining an internal copy of a library ▫ Forking a library• Clonewise detects if two packages share code• And if one package is entirely embedded in another. Firefox Vulnerabilities libpng Vulnerabilities
  17. 17. Feature Extraction – Shared packageclone detection 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. N_Common_Filenames 6. N_Common_Similar_Filenames 7. N_Common_FilenameHashes 8. N_Common_FilenameHash80 9. N_Common_ExactFilenameHash 10. N_Score_of_Common_Filename 11. N_Score_of_Common_Similar_Filename 12. N_Score_of_Common_FilenameHash 13. N_Score_of_Common_FilenameHash80 14. N_Score_of_Common_ExactFilenameHash80 15. N_Data_Common_Filenames 16. N_Data_Common_Similar_Filenames 17. N_Data_Common_FilenameHashes 18. N_Data_Common_FilenameHash80 19. N_Data_Common_ExactFilenameHash 20. N_Data_Score_of_Common_Filename 21. N_Data_Score_of_Common_Similar_Filename 22. N_Data_Score_of_Common_FilenameHash 23. N_Data_Score_of_Common_FilenameHash80 24. N_Data_Score_of_Common_ExactFilenameHash80 25. N_Common_ExactHash 26. N_Common_DataExactHash
  18. 18. Classification• Consider feature vectors as n-dimensional points in space.• Linear classifiers• Non-linear classifiers• Decision trees Class A Class B
  19. 19. Feature Extraction – Embedded clonedetection 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. Percent_Match_In_A 6. Percent_Data_Match_In_A 7. Percent_Match_In_B 8. Percent_Data_Match_In_B 9. Percent_Score_In_A 10.Percent_Data_Score_In_A 11.Percent_Score_In_B 12.Percent_Data_Score_In_B 13.A_Has_Lib_In_Name 14.B_Has_Lib_In_Name 15.A_To_B_Ratio 16.A_To_B_Data_Ratio 17.N_Dependents_A 18.N_Dependents_B
  20. 20. Detecting copyright violations1. Identify embedded package clones.2. Extract license information of each package.3. For each GPL licensed embedded package clone: ▫ Verify that the package it is embedded in is not licensing it under a permissive license.
  21. 21. Automated Vulnerability Inference1. Take CVE, match CPE name to Debian package.2. Parse CVE summary and extract vuln filename.3. Find clones of package with similar filename.4. Trim dynamically linked clones.5. Is vuln affected clone already being tracked?
  22. 22. Package clone detection use-case
  23. 23. Finding Vulnerabilities
  24. 24. Shared package clone evaluation Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 439/322 484/56296 57.69% 0.85% Multilayer Perceptron 204/557 48/56732 26.81% 0.08% C4.5 523/238 86/56694 68.73% 0.15% Random Forest 533/228 60/56720 70.04% 0.11% Random Forest (0.8) 446/315 15/56765 58.61% 0.03%
  25. 25. Embedded clone detection evaluation Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 718/43 6341/2808 94.35% 69.31% Multilayer Perceptron 328/433 108/9041 43.10% 1.18% C4.5 572/189 69/9080 75.16% 0.75% Random Forest 554/207 68/9081 72.80% 0.74% Asymmetric Bagging 699/62 615/8534 91.86% 6.72%
  26. 26. Automatic detection of suspiciousclones PACKAGE EMBEDDED PACKAGE freevo feedparser hedgewars freetype ia32-libs * libtk-img tiff likewise-open curl luatex poppler planet-venus feedparser syslinux libpng vnc4 freetype vtk tiff
  27. 27. Future Work• Binary-level clone detection• Integrate into Linux distributions• Linux security teams usage
  28. 28. Clonewise summary• Practical clone detection in Linux• Improves manual only tracking• Has found bugs• Debian Linux want to integrate it into infrastructure• Open source project• Web service to perform clone detection
  29. 29. Detecting bugs in binaries using decompilation and data flowanalysis
  30. 30. Motivation• Detecting bugs in binary is useful ▫ Black-box penetration testing ▫ External audits and compliance ▫ Quality assurance of 3rd party software ▫ Verification of compilation and linkage
  31. 31. Wire – A formal language for binaryanalysis• x86 is complex and big• Wire is a low level RISC assembly style language• Translated from x86• Formally defined operational semantics The LOAD instruction implements a memory read.
  32. 32. Stack Pointer Inference• Proposed in HexRays decompiler - http://www.hexblog.com/?p=42• Estimate Stack Pointer (SP) in and out of basic block ▫ By tracking and estimating SP modifications using linear inequalities• Solve.Picture from HexRays blog .
  33. 33. Decompilation - Local Variable Recovery • Based on stack pointer inference • Access to memory offset to the stack • Replace with native Wire registerImark ($0x80483f5, , )AddImm32 (%esp(4), $0x1c, %temp_memreg(12c))LoadMem32 (%temp_memreg(12c), , %temp_op1d(66)) Imark ($0x80483f5, , )Imark ($0x80483f9, , ) Imark ($0x80483f9, , )StoreMem32(%temp_op1d(66), , %esp(4))  Imark ($0x80483fc, , )Imark ($0x80483fc, , ) Free (%local_28(186bc), , )SubImm32 (%esp(4), $0x4, %esp(4))LoadImm32 ($0x80483fc, , %temp_op1d(66))StoreMem32(%temp_op1d(66), , %esp(4))Lcall (, , $0x80482f0)
  34. 34. Data Flow Analysis - ReachingDefinitions• A reaching definition is a definition of a variable that reaches a program point without being redefined. X=1 Y=3 X>2 X <=2 X=2 Print(X) Print(X) Y=3, X=1, and X=2 are Print(X) reaching definitions
  35. 35. More data flow problems• Upward Exposed Uses ▫ All uses of a definition• Live Variables ▫ A variable is live if it will be subsequently read without being redefined.• Reaching Copies ▫ The reach of a copy statement• etc
  36. 36. getenv() bugs• Detect unsafe applications of getenv()• Example: strcpy(buf,getenv(“HOME”))• For each getenv() ▫ If return value is live ▫ And it‟s the reaching definition to the 2nd argument to strcpy() ▫ Then warn• P.S. 2001 wants its bugs back.
  37. 37. Use-after-free Detection• For each free(ptr) ▫ If ptr live void f(int x) ▫ Then warn { int *p = malloc(10); dowork(p); free(p); if (x) p[0] = 1; }
  38. 38. Double Free Detection• For each free(ptr) ▫ If an upward exposed use of ptr‟s definition is free(ptr) void f(int x) ▫ Then warn { int *p = malloc(10); dowork(p); free(p); if (x) free(p);• 2001 calls again }
  39. 39. getenv() bugs• Scanned entire Debian 7 unstable repository• ~123,000 ELF binaries 4digits ptop acedb-other-belvu recordmydesktop acedb-other-dotter rlplot bvi sapphire• 85 bug reports comgt csmash sc scm elvis-tiny sgrep• 47 packages fvwm garmin-ant-downloader slurm-llnl-slurmdbd statserial gcin stopmotion gexec supertransball2 gmorgan theorur gopher twpsk gsoko udo gstm vnc4server hime wily le-dico-de-rene-cougnenc wmpinboard libreoffice-dev wmppp.app libxgks-dev xboing lie xemacs21-bin lpe xjdic mp3rename xmotd mpich-mpd-bin open-cobol procmail
  40. 40. getenv() bugs over time –sorted by binary size• Linear or power growth?
  41. 41. getenv() bug statistics• Probability (P) of a binary being vulnerable: 0.00067• P. of a package being vulnerable: 0.00255 P( A B) P( A | B) P(B) Conditional probability of A given that B has occurred:• P. of a package having a 2nd vulnerability given that one binary in the package is vulnerable: 0.52380
  42. 42. Double free in SGID games “xonix” memset(score_rec[i].login, 0, 11); strncpy(score_rec[i].login, pw->pw_name, 10); memset(score_rec[i].full, 0, 65); strncpy(score_rec[i].full, fullname, 64); score_rec[i].tstamp = time(NULL); free(fullname); if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) { fprintf(stderr, "xonix: cannot reopen high score filen"); free(fullname); gameover_pending = 0; return; }
  43. 43. Future Work• Core ▫ Summary-based interprocedural analysis ▫ Context sensitive interprocedural analysis ▫ Pointer analysis ▫ Improved decompilation• More bug classes
  44. 44. Bugwise summary• Practical tool to find simple bugs• Based on strong theory• Extensible• Much work to do in the future• Web service free to use
  45. 45. Future Work• Make more of my research public• Provide better backend infrastructure• Get people to use the services!
  46. 46. Conclusion• All of the tools in this talk are for public use• http://www.FooCodeChu.com ▫ Wiki on software similarity and classification ▫ Preprint of my book available• Buy my book from Springer

×