Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DSSM_slides

262 views

Published on

Slides for the research work: dataflow sensitive specification mining

  • Be the first to comment

  • Be the first to like this

DSSM_slides

  1. 1. Mining Dataflow Sensitive Specifications Zhiqiang Zuo and Siau-Cheng Khoo National University of Singapore 1
  2. 2. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 2
  3. 3. Introduction • Fact: CRUCIAL  Specifications play a crucial role in many development tasks  E.g., Program comprehension, verification, debugging etc. • Problem: LACKNESS  Lack of precise and complete specs is a common situation • Solution: Specification Mining  Automatically inferring models or properties as specifications 3
  4. 4. Statistical Inference Significant program properties occur frequently • Obstacle: Meaningless  A huge amount of meaningless specs  Laborious to separate, adversely affect the efficiency • Reason:  statistical significance ≠ semantic significance 4
  5. 5. Example + static boolean supportsDataDescriptorFor(ZipArchiveEntry entry){ + return !entry.getGeneralPurposeBit().usesDataDescriptor() + || entry.getMethod() == ZipArchiveEntry.DEFLATED; + } static boolean supportsEncryptionOf(ZipArchiveEntry entry){ - return !entry.isEncrypted(); + return !entry.getGeneralPurposeBit().usesEncryption(); } 5 Code change between revision 922309 and 922299 in ZipUtil ZipUtil: supportsDataDescriptorFor(. . . ); ZipArchiveEntry: getGeneralPurposeBit(); R: ZipArchiveEntry: getGeneralPurposeBit(); GeneralPurposeBit: usesDataDescriptor(); ZipUtil: supportsDataDescriptorFor(. . . ); ZipArchiveEntry: getMethod(); R: ZipArchiveEntry: getMethod(); ZipUtil: checkRequestedFeatures(. . . ); ZipUtil: supportsEncryptionOf(. . . ); ZipArchiveEntry: getGeneralPurposeBit(); R: ZipArchiveEntry: getGeneralPurposeBit(); GeneralPurposeBit: usesEncryption(); ZipUtil: checkRequestedFeatures(. . . ); ZipUtil: supportsEncryptionOf(. . . ); ZipArchiveEntry: isEncrypted(); R: ZipArchiveEntry: isEncrypted(); Totally 63 discriminative patterns 48 additional and 15 deleted patterns
  6. 6. Semantic-directed Spec Mining 6 Semantically significant || semantically relevant + statistically significant Program Instrumenter Instrumented program Traces Frequent pattern miner Semantically relevant sequences Semantically significant specifications Run Semantic analyzer
  7. 7. Dataflow Sensitive Spec Mining • Semantic relation  dataflow relation (use-def pair)  Dynamic, inter-procedural dataflow tracking analysis • Formalism  iterative pattern†  Constrained iterative pattern mining † Lo et al., Efficient mining of iterative patterns for software specification discovery, KDD’07 7
  8. 8. 8 Program Symbolic instrumentor Instrumented program Run Traces Dataflow tracker Dataflow related sequences Constrained iterative pattern miner Specifications
  9. 9. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 9
  10. 10. Instrumentation • Static instrumentation • Trace  a sequence of symbolic statements • Statement  3-address Jimple format in Soot framework †  identityStmt, assignStmt, invokeStmt, returnStmt † http://www.sable.mcgill.ca/soot/ 10
  11. 11. public class Demo { public int invoke(int a, int b){ int sa = square(a); int sb = square(b); return max(sa, sb) / 2; } private int square(int r){ Demo return r * r; } static int max(int i, int j){ if(i > j) {return i;} else {return j;} } static void main(String[] args){ new Demo().invoke(2, -1); } } 1: $r1 = new Demo 2: invoke $r1.<Demo:int invoke(int,int)>(2,-1) 3: r0 := @this: Demo 4: i0 := @parameter0: int 5: i1 := @parameter1: int 6: i2= invoke r0.<Demo:int square(int)>(i0) 7: r0 := @this: Demo 8: i0 := @parameter0: int 9: $i1 = i0 * i0 10: return $i1 11: i3= invoke r0.<Demo:int square(int)>(i1) 12: r0 := @this: Demo 13: i0 := @parameter0: int 14: $i1 = i0 * i0 15: return $i1 16: $i4= invoke <Demo:int max(int,int)>(i2,i3) 17: i0 := @parameter0: int 18: i1 := @parameter1: int 19: return i0 20: $i5 = $i4 / 2 21: return $i5 Trace 11
  12. 12. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 12
  13. 13. Dataflow Tracking Analysis • Input  Execution traces (sequence of stmts executed)  E.g., trace • Output  Dataflow related sequences (sequence of calls and returns)  E.g., <2, 6, 10, 16, 19, 21> 13
  14. 14. Definition • Data dependency  E.g., 9 is data dependent on 8 • Dataflow path  a sequence of stmts  E.g., <4, 8, 9, 17, 20> • Dataflow association  E.g., 6 and 10 are both associated with <4, 8, 9, 17, 20>  6->8, 10->9 • (Maximum) dataflow related sequence  a sequence of events (calls and returns)  E.g., <2, 6, 10, 16, 19, 21> 14 1: $r1 = new Demo 2: invoke $r1.<Demo:int invoke(int,int)>(2,-1) 3: r0 := @this: Demo 4: i0 := @parameter0: int 5 5: i1 := @parameter1: int 6: i2= invoke r0.<Demo:int square(int)>(i0) 7: r0 := @this: Demo 8: i0 := @parameter0: int 9: $i1 = i0 * i0 10: return $i1 11: i3= invoke r0.<Demo:int square(int)>(i1) 12: r0 := @this: Demo 13: i0 := @parameter0: int 14: $i1 = i0 * i0 15: return $i1 16: $i4= invoke <Demo:int max(int,int)>(i2,i3) 17: i0 := @parameter0: int 18: i1 := @parameter1: int 19: return i0 20: $i5 = $i4 / 2 21: return $i5 14 2 4 9 13 12 8 7 17 18 1 3 20 6 10 11 15 16 19 21
  15. 15. 4 8 9 17 20 15 1: $r1 = new Demo 2: invoke $r1.<Demo:int invoke(int,int)>(2,-1) 3: r0 := @this: Demo 4: i0 := @parameter0: int 5: i1 := @parameter1: int 6: i2= invoke r0.<Demo:int square(int)>(i0) 7: r0 := @this: Demo 8: i0 := @parameter0: int 9: $i1 = i0 * i0 10: return $i1 11: i3= invoke r0.<Demo:int square(int)>(i1) 12: r0 := @this: Demo 13: i0 := @parameter0: int 14: $i1 = i0 * i0 15: return $i1 16: $i4= invoke <Demo:int max(int,int)>(i2,i3) 17: i0 := @parameter0: int 18: i1 := @parameter1: int 19: return i0 20: $i5 = $i4 / 2 21: return $i5 2 6 10 16 19 21 dataflow path dataflow related sequence
  16. 16. Approach • Track each dataflow path through analyzing use-def pairs stmt by stmt in chronological order; • Maintain one specific event list for each dataflow path; • Append the event which is dataflow associated with the currently tracked dataflow path, to the end of corresponding event list; • Output the event list when arriving at the end of dataflow path. 16
  17. 17. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 17
  18. 18. Constrained Iterative Pattern Mining • Input  A set of dataflow related sequences • Output  Frequent dataflow related iterative patterns • Problem Setting  An iterative pattern is identified by a set of instances  One sequence can contain multiple instances of a pattern. 18
  19. 19. Iterative Pattern Instance -- QRE Given a pattern pn (<e1e2…en>), a substring (<f1f2…fm>) of a temporal sequence t (<t1t2…tend>) in a sequence database is an instance of pn iff it can be expressed by the following QRE expression: e1;[-e1,…,en]*;e2;…;[-e1,…,en]*;en a c d c d e f b 19 Pattern: <a, c>
  20. 20. 20 Index 1 2 3 4 5 6 7 8 Event a c d c d e f b Tid Transaction 0 1 a 2 c 1 1 a 4 c 2 1 a 2 c 3 d 6 e 7 f 8 b 3 1 a 4 c 5 d 6 e • Duplication: one event maybe associated with multiple dataflow paths • Correspondence: some instances may not be the reasonable instances
  21. 21. Constrained Iterative Pattern Instance Pattern: <a, ce> Given a trace T and its event list L(T), an ordering number subsequence sub_s (< o1o2…on>) of a sequence s in D(T) is a constrained iterative pattern instance of pn (<e1e2…en>) iff the following two conditions hold: 21 Tid Transaction 0 1 a 2 c 1 1 a 4 c 2 1 a 2 c 3 d 6 e 7 f 8 b 3 1 a 4 c 5 d 6 e Index 1 2 3 4 5 6 7 8 Event a c d c d e f b
  22. 22. Apriori Property If a pattern pk (<e1,e2,…, ek-1, ek>) is frequent, then its prefix_pattern (<e1,e2,…, ek-1>), suffix_pattern (<e2,…, ek-1, ek>) and all infix_pattern in_pk-1 (<e1,…,ei-1,ei+1,…,ek> where 2≤i≤k-1 and ei is not in in_pk-1) are frequent. 22
  23. 23. Algorithm 23 Algorithm: CIPM (D(T), min_sup, min_den) Input: sequence database D(T) , support threshold min_sup, density threshold min_den Output: a set containing all frequent closed patterns Fclosed 1. F1 <- {p1 | sup(p1) ≥ min_sup}; 2. for (k <- 2; Fk-1 ≠ empty set; k++) do 3. Ck <- apriori_gen(Fk-1); 4. Fk <- apriori_count(Ck, min_sup); 5. Fk <- prune_density(Fk, min_den); 6. Fclosed <- process_closed(Fk, Fclosed); 7. end 8. return Fclosed;
  24. 24. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 27
  25. 25. Evaluation Subject Version #LoC #Class #Method Description JDepend 2.9.1 2,723 18 224 Java dependency analyzer Libsvm 3.1 3,188 21 98 SVM implementation Compress 1.3 9,629 59 502 Commons Compress library PMD 4.2.5 66,881 720 4,991 Java source code analyzer Fop 0.95 185,186 1,313 9,840 XSL-FO to PDF transformer 28
  26. 26. Runtime Performance of Dataflow Tracker Trace Generation Dataflow Analysis Subject #Test #Trace #Stmt(k) #Event(k) Time(s) #Seq(k) AL Time(s) JDepend 5 5 494 93 5 73 6.9 5 Libsvm 5 5 854 36 7 8 5.7 5 Compress 5 5 949 254 10 156 4.7 10 PMD 4 8 2119 498 17 299 26.8 20 Fop 5 5 3480 621 42 535 10.6 53 29
  27. 27. 50 40 30 20 10 0 0 1000 2000 3000 4000 30 Statements(k) Time(s) Execution time against the number of statements
  28. 28. Performance Comparison DSSM IPM Ratio(IPM/DSSM) Subject Support #Pattern Time(s) #Pattern Time(s) #Pattern Time JDepend 15 30 50 221 181 54 17 15 9 * - >10 Libsvm 5 10 30 130 115 48 16 15 7 * * 175 - - 99 >10 3.6 >10 Compress 15 50 100 79 44 32 22 19 18 * - >10 PMD 150 250 450 205 100 32 97 45 35 * * 149 - >10 4.7 >10 Fop 400 1000 1500 211 70 21 171 81 60 * * 636 - - 1342 >10 31
  29. 29. Case Study 1 + static boolean supportsDataDescriptorFor(ZipArchiveEntry entry){ + return !entry.getGeneralPurposeBit().usesDataDescriptor() + || entry.getMethod() == ZipArchiveEntry.DEFLATED; + } static boolean supportsEncryptionOf(ZipArchiveEntry entry){ - return !entry.isEncrypted(); + return !entry.getGeneralPurposeBit().usesEncryption(); } 32 Code change between revision 922309 and 922299 in ZipUtil ZipUtil: supportsDataDescriptorFor(. . . ); ZipArchiveEntry: getGeneralPurposeBit(); R: ZipArchiveEntry: getGeneralPurposeBit(); GeneralPurposeBit: usesDataDescriptor(); ZipUtil: supportsDataDescriptorFor(. . . ); ZipArchiveEntry: getMethod(); R: ZipArchiveEntry: getMethod(); ZipUtil: checkRequestedFeatures(. . . ); ZipUtil: supportsEncryptionOf(. . . ); ZipArchiveEntry: getGeneralPurposeBit(); R: ZipArchiveEntry: getGeneralPurposeBit(); GeneralPurposeBit: usesEncryption(); ZipUtil: checkRequestedFeatures(. . . ); ZipUtil: supportsEncryptionOf(. . . ); ZipArchiveEntry: isEncrypted(); R: ZipArchiveEntry: isEncrypted(); IPM: 48 additional and 15 deleted patterns DSSM: 3 additional and 1 deleted pattern
  30. 30. Case Study 2 33 Additional pattern in revision 911467 zip.ZipArchiveInputStream: void fill(); ArchiveInputStream: void count(int); ArchiveInputStream: void count(long); R: ArchiveInputStream: void count(long); R: ArchiveInputStream: void count(int); R: zip.ZipArchiveInputStream: void fill(); IPM: 151 patterns, 11 of them involve method fill DSSM: 7 patterns, 0 involves fill
  31. 31. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 34
  32. 32. Related Work • Semantic-based specification mining  Intra-procedural data dependency: POPL’02, ICSE’09  Object sharing relation: ASE’09, FSE’07, ICSE’11  Data predicates: Daikon (TSE’01) • Program Slicing  Dynamic data slicing: ICSE’03 35
  33. 33. Outline • Introduction • Instrumentation • Dataflow Tracking Analysis • Constrained Iterative Pattern Mining • Empirical Evaluation • Related Work • Conclusion 37
  34. 34. Conclusion • Semantic-directed specification mining  Automatically infer semantically significant specifications • Dataflow tracking analysis  Dynamic, fine-grained, inter-procedural, scoped • Constrained iterative pattern mining  Apriori-like, BFS • Experiments  Effective in filtering semantically irrelevant patterns  Efficient in generating semantically significant patterns  Practical in program understanding and bug detection 38
  35. 35. Thank you questions, comments, advice? 39

×