- 1. Data Mining: Concepts and Techniques — Chapter 11 — — Software Bug Mining — <ul><li>Jiawei Han and Micheline Kamber </li></ul><ul><li>Department of Computer Science </li></ul><ul><li>University of Illinois at Urbana-Champaign </li></ul><ul><li>www.cs.uiuc.edu/~hanj </li></ul><ul><li>©2006 Jiawei Han and Micheline Kamber. All rights reserved. </li></ul><ul><li>Acknowledgement: Chao Liu </li></ul>
- 3. Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
- 4. Software Bugs Are Costly <ul><li>Software is “full of bugs” </li></ul><ul><ul><li>Windows 2000, 35 million lines of code </li></ul></ul><ul><ul><ul><li>63,000 known bugs at the time of release, 2 per 1000 lines </li></ul></ul></ul><ul><li>Software failure costs </li></ul><ul><ul><li>Ariane 5 explosion due to “errors in the software of the inertial reference system” (Ariaen-5 flight 501 inquiry board report http://ravel.esrin.esa.it/docs/esa-x-1819eng.pdf ) </li></ul></ul><ul><ul><li>A study by the National Institute of Standards and Technology found that software errors cost the U.S. economy about $59.5 billion annually http://www.nist.gov/director/prog-ofc/report02-3.pdf </li></ul></ul><ul><li>Testing and debugging are laborious and expensive </li></ul><ul><ul><li>“ 50% of my company employees are testers, and the rest spends 50% of their time testing!” —Bill Gates, in 1995 </li></ul></ul>
- 5. Automated Failure Reporting <ul><li>End-users as Beta testers </li></ul><ul><ul><li>Valuable information about failure occurrences in reality </li></ul></ul><ul><ul><li>24.5 million/day in Redmond (if all users send) – John Dvorak, PC Magazine </li></ul></ul><ul><li>Widely adopted because of its usefulness </li></ul><ul><ul><li>Microsoft Windows, Linux Gentoo, Mozilla applications … </li></ul></ul><ul><ul><li>Any applications can implement this functionality </li></ul></ul>
- 6. After Failures Collected …: Failure triage <ul><li>Failure triage </li></ul><ul><ul><li>Failure prioritization: </li></ul></ul><ul><ul><ul><li>What are the most severe bugs? </li></ul></ul></ul><ul><ul><li>Failure assignment: </li></ul></ul><ul><ul><ul><li>Which developers should debug a given set of failures? </li></ul></ul></ul><ul><li>Automated debugging </li></ul><ul><ul><li>Where is the likely bug location? </li></ul></ul>
- 7. A Glimpse on Software Bugs <ul><li>Crashing bugs </li></ul><ul><ul><li>Symptoms: segmentation faults </li></ul></ul><ul><ul><li>Reasons: memory access violations </li></ul></ul><ul><ul><li>Tools: Valgrind, CCured </li></ul></ul><ul><li>Noncrashing bugs </li></ul><ul><ul><li>Symptoms: unexpected outputs </li></ul></ul><ul><ul><li>Reasons: logic or semantic errors </li></ul></ul><ul><ul><ul><li>if ((m >= 0)) vs. if ((m >= 0) && (m != lastm )) </li></ul></ul></ul><ul><ul><ul><li>< vs. <=, > vs. >=, etc .. </li></ul></ul></ul><ul><ul><ul><li>j = i vs. j= i+1 </li></ul></ul></ul><ul><ul><li>Tools: No sound tools </li></ul></ul>
- 8. Semantic Bugs Dominate <ul><li>Semantic Bugs : </li></ul><ul><li>Application specific </li></ul><ul><li>Only few detectable </li></ul><ul><li>Mostly require annotations or specifications </li></ul><ul><li>Memory-related Bugs: </li></ul><ul><li>Many are detectable </li></ul>Others Concurrency bugs <ul><li>Bug Distribution [Li et al., ICSE’07] </li></ul><ul><li>264 bugs in Mozilla and 98 bugs in Apache manually checked </li></ul><ul><li>29,000 bugs in Bugzilla automatically checked </li></ul>Courtesy of Zhenmin Li
- 9. Hacking Semantic Bugs is HARD <ul><li>Major challenge: No crashes! </li></ul><ul><ul><li>No failure signatures </li></ul></ul><ul><ul><li>No debugging hints </li></ul></ul><ul><li>Major Methods </li></ul><ul><ul><li>Statistical debugging of semantic bugs [Liu et al., FSE’05, TSE’06] </li></ul></ul><ul><ul><li>Triage noncrashing failures through statistical debugging [Liu et al., FSE’06] </li></ul></ul>
- 10. Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
- 11. A Running Example void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ lastm = m; } if ((m == -1) || (m == i)){ i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } <ul><li>130 of 5542 test cases fail, no crashes </li></ul>5 1 Predicate evaluation as tossing a coin # of false # of true Predicate (lin[i] != ENDSTR)==true Ret_amatch > 0 Ret_amatch == 0 Ret_amatch < 0 (m >= 0) == true (m == i) == true (m >= -1) == true 5 1 1 5 1 5 4 2 2 4 1 5
- 12. Profile Executions as Vectors <ul><li>Extreme case </li></ul><ul><ul><li>Always false in passing and always true in failing … </li></ul></ul><ul><li>Generalized case </li></ul><ul><ul><li>Different true probability in passing and failing executions </li></ul></ul>Two passing executions One failing execution 5 1 4 2 5 1 2 4 5 1 5 1 1 5 19 1 18 2 19 1 2 18 19 1 19 1 1 19 9 1 8 2 2 8 2 8 9 1 9 1 1 9
- 13. Estimated Head Probability <ul><li>Evaluation bias </li></ul><ul><ul><li>Estimated head probability from every execution </li></ul></ul><ul><ul><li>Specifically, </li></ul></ul><ul><ul><li>where and are the number of true and false evaluations in one execution. </li></ul></ul><ul><ul><li>Defined for each predicate and each execution </li></ul></ul>
- 14. Divergence in Head Probability <ul><li>Multiple evaluation biases from multiple executions </li></ul><ul><li>Evaluation bias as generated from models </li></ul>0 1 Prob Head Probability 0 1 Prob Head Probability
- 15. Major Challenges <ul><li>No closed form of either model </li></ul><ul><li>No sufficient number of failing executions to estimate </li></ul>0 1 Prob Head Probability 0 1 Prob Head Probability
- 17. SOBER in Summary Test Suite Source Code Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 6 Pred 1 Pred 3 SOBER
- 18. Previous State of the Art [Liblit et al, 2005] <ul><li>Correlation analysis </li></ul><ul><ul><li>Context( P ) = Prob(fail | P ever evaluated) </li></ul></ul><ul><ul><li>Failure( P ) = Prob(fail | P ever evaluated as true ) </li></ul></ul><ul><ul><li>Increase( P ) = Failure( P ) – Context( P ) </li></ul></ul><ul><li>How more likely the program fails when a predicate is ever evaluated true </li></ul>
- 19. Liblit05 in Illustration Failing Passing O + + + + + + O O O O O O O O O Context(P) = Prob(fail | P ever evaluated) = 4/10 = 2/5 Increase(P) = Failure(P) – Context(P) = 3/7 – 2/5 = 1/35 Failure(P) = Prob(fail | P ever evaluated as true) = 3/7
- 20. SOBER in Illustration O + + + + + + O O O O O O O O O Failing Passing 0 1 Prob Evaluation bias 0 1 Prob Evaluation bias
- 21. Difference between SOBER and Liblit05 <ul><li>Methodology: </li></ul><ul><ul><li>Liblit05: Correlation analysis </li></ul></ul><ul><ul><li>SOBER: Model-based approach </li></ul></ul><ul><li>void subline(char *lin, char *pat, char *sub) </li></ul><ul><li>{ </li></ul><ul><li>1 int i, lastm, m; </li></ul><ul><li>2 lastm = -1; </li></ul><ul><li>3 i = 0; </li></ul><ul><li>4 while((lin[i] != ENDSTR)) { </li></ul><ul><li>5 m = amatch(lin, i, pat, 0); </li></ul><ul><li>6 if (m >= 0){ </li></ul><ul><li>7 putsub(lin, i, m, sub); </li></ul><ul><li>8 lastm = m; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>11 } </li></ul><ul><li>Utilized information </li></ul><ul><ul><li>Liblit05: Ever true? </li></ul></ul><ul><ul><li>SOBER: What percentage is true? </li></ul></ul><ul><li>Liblit05: </li></ul><ul><li>Line 6 is ever true in most passing and failing exec. </li></ul><ul><li>SOBER: </li></ul><ul><li>Prone to be true in failing exec. </li></ul><ul><li>Prone to be false in passing exec. </li></ul>
- 22. T-Score: Metric of Debugging Quality <ul><li>How close is the blamed to the real bug location? </li></ul>void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } T-score = 70%
- 23. A Better Debugging Result void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } T-score = 40%
- 24. Evaluation 1: Siemens Program Suite <ul><li>T-Score <= 20% is meaningful </li></ul><ul><li>Siemens program suite </li></ul><ul><ul><li>130 buggy versions of 7 small (<700LOC) programs </li></ul></ul><ul><li>What percentage bugs can be located with no more than % code examination </li></ul>
- 25. Evaluation 2: Reasonably Large Programs Software-artifact Infrastructure Repository (SIR): http://sir.unl.edu 2.9% 17/217 Subclause-missing Bug 2 0.5% 65/217 Subclause-missing Bug 1 Gzip 1.2 (6,184 LOC) 0.2% 88/470 Subclause-missing Bug 2 0.6% 48/470 Off-by-one Bug 1 Grep 2.2 (11,826 LOC) 45.6% 92/525 Off-by-one Bug 5 15.4% 22/525 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) Bug 4 7.6% 69/525 Mis-assign value true for false Bug 3 1.6% 356/525 Misuse of = for == Bug 2 0.5% 163/525 Misuse >= for > Bug 1 Flex 2.4.7 (8,834 LOC) T-Score Failure Number Bug Type
- 26. A Glimpse of Bugs in Flex-2.4.7
- 27. Evaluation 2: Reasonably Large Programs Software-artifact Infrastructure Repository (SIR): http://sir.unl.edu 2.9% 17/217 Subclause-missing Bug 2 0.5% 65/217 Subclause-missing Bug 1 Gzip 1.2 (6,184 LOC) 0.2% 88/470 Subclause-missing Bug 2 0.6% 48/470 Off-by-one Bug 1 Grep 2.2 (11,826 LOC) 45.6% 92/525 Off-by-one Bug 5 15.4% 22/525 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) Bug 4 7.6% 69/525 Mis-assign value true for false Bug 3 1.6% 356/525 Misuse of = for == Bug 2 0.5% 163/525 Misuse >= for > Bug 1 Flex 2.4.7 (8,834 LOC) T-Score Failure Number Bug Type
- 28. A Close Look: Grep-2.2: Bug 1 <ul><li>11,826 lines of C code </li></ul><ul><li>3,136 predicates instrumented </li></ul><ul><li>48 out of 470 cases fail </li></ul>
- 29. Grep-2.2: Bug 2 <ul><li>11,826 lines of C code </li></ul><ul><li>3,136 predicates instrumented </li></ul><ul><li>88 out of 470 cases fail </li></ul>
- 30. No Silver Bullet: Flex Bug 5 No wrong value in chk[offset -1] chk[offset] is not used here but later <ul><li>8,834 lines of C code </li></ul><ul><li>2,699 predicates instrumented </li></ul>
- 31. Experiment Result in Summary Effective for bugs demonstrating abnormal control flows 2.9% 17/217 Subclause-missing Bug 2 0.5% 65/217 Subclause-missing Bug 1 Gzip 1.2 (6,184 LOC) 0.2% 88/470 Subclause-missing Bug 2 0.6% 48/470 Off-by-one Bug 1 Grep 2.2 (11,826 LOC) 45.6% 92/525 Off-by-one Bug 5 15.4% 22/525 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) Bug 4 7.6% 69/525 Mis-assign value true for false Bug 3 1.6% 356/525 Misuse of = for == Bug 2 0.5% 163/525 Misuse >= for > Bug 1 Flex 2.4.7 ( 8,834 LOC) T-Score Failure Number Bug Type
- 32. SOBER Handles Memory Bugs As Well <ul><li>bc 1.06: </li></ul><ul><li>Two memory bugs found with SOBER </li></ul><ul><li>One of them is unreported </li></ul><ul><li>Blamed location is NOT the crashing venue </li></ul>
- 33. Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
- 34. Major Problems in Failure Triage <ul><li>Failure Prioritization </li></ul><ul><ul><li>What failures are likely due to the same bug </li></ul></ul><ul><ul><li>What bugs are the most severe </li></ul></ul><ul><ul><li>Worst 1% bugs = 50% failures </li></ul></ul><ul><li>Failure Assignment </li></ul><ul><ul><li>Which developer should debug which set of failures? </li></ul></ul>Courtesy of Microsoft Corporation
- 35. <ul><li>Failure indexing </li></ul><ul><ul><li>Identify failures likely due to the same bug </li></ul></ul>A Solution: Failure Clustering X Y 0 Most sever Less Severe Least Severe Fault in core.io ? Fault in function initialize() ? Failure Reports + + + + + + + + + + + + + + + + + + + + + +
- 36. The Central Question: A Distance Measure between Failures <ul><li>Different measures render different clusterings </li></ul>X Y Dist. defined on X-axis Dist. defined on Y-axis 0 0 0 O O O O O O + + + + + + + O + X + + O O O O O O O O + + + + + + + Y
- 37. How to Define a Distance <ul><li>Previous work [Podgurski et al., 2003] </li></ul><ul><ul><li>T-Proximity : Distance defined on literal trace similarity </li></ul></ul><ul><li>Our approach [Liu et al., 2006] </li></ul><ul><ul><li>R-Proximity : Distance defined on likely bug location </li></ul></ul>= SOBER
- 38. Why Our Approach is Reasonable <ul><li>Optimal proximity: defined on root causes ( RC ) </li></ul><ul><li>Our approach: defined on likely causes ( LC ) </li></ul>F P + + + + X Y 0 = Automated Fault Localization
- 39. R-Proximity: An Instantiation with SOBER <ul><li>Likely causes (LCs) are predicate rankings </li></ul>Pred 2 Pred 3 Pred 1 Pred 6 Pred 2 Pred 3 Pred 1 Pred 6 Pred 2 Pred 6 Pred 1 Pred 3 F P A distances between rankings is needed + + + + X Y 0 Pred 2 Pred 6 Pred 1 Pred 3 SOBER
- 40. Distance between Rankings <ul><li>Traditional Kendall’s tau distance </li></ul><ul><ul><li>Number of preference disagreements </li></ul></ul><ul><ul><li>E.g. </li></ul></ul><ul><li>NOT all predicates need to be considered? </li></ul><ul><ul><li>Predicates are uniformly instrumented </li></ul></ul><ul><ul><li>Only fault-relevant predicates count </li></ul></ul>
- 41. Predicate Weighting in a Nutshell <ul><li>Fault-relevant predicates receive higher weights </li></ul><ul><li>Fault-relevance is implied by rankings </li></ul>Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 1 Pred 3 Pred 6 Pred 2 Pred 1 Pred 3 Pred 6 Mostly favored predicates receive higher weights
- 42. Automated Failure Assignment <ul><li>Most-favored predicates indicate the agreed bug location for a group of failures </li></ul><ul><li>Predicate spectrum graph </li></ul>Pred. Index Y 0 1 3 6 2 4 5 2 4 Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 1 Pred 3 Pred 6 Pred 2 Pred 1 Pred 3 Pred 6
- 43. Case Study 1: Grep-2.2 <ul><li>470 test cases in total </li></ul><ul><li>136 cases fail due to both faults, no crashes </li></ul><ul><li>48 fail due to Fault 1, 88 fail due to Fault 2 </li></ul>
- 44. Failure Proximity Graphs <ul><li>Red crosses are failures due to Fault 1 </li></ul><ul><li>Blue circles are failures due to Fault 2 </li></ul><ul><li>Divergent behaviors due to the same fault </li></ul><ul><li>Better clustering result under R-Proximity </li></ul>T-Proximity R-Proximity
- 45. Guided Failure Assignment <ul><li>What predicates are favored in each group? </li></ul>
- 46. Assign Failures to Appropriate Developers <ul><li>The 21 failing cases in Cluster 1 are assigned to developers responsible for the function grep </li></ul><ul><li>The 112 failing cases in Cluster 2 are assigned to developers responsible for the function comsub </li></ul>
- 47. Case Study 2: Gzip-1.2.3 <ul><li>217 test cases in total </li></ul><ul><li>82 cases fail due to both faults, no crashes </li></ul><ul><li>65 fail due to Fault 1, 17 fail due to Fault 2 </li></ul>
- 48. Failure Proximity Graphs <ul><li>Red crosses are for failures due to Fault 1 </li></ul><ul><li>Blue circles are for failures due to Fault 2 </li></ul><ul><li>Nearly perfect clustering under R-Proximity </li></ul><ul><li>Accurate failure assignment </li></ul>T-Proximity R-Proximity
- 49. Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
- 50. Mining Copy-Paste Bugs <ul><li>Copy-pasting is common </li></ul><ul><ul><li>12% in Linux file system [Kasper2003] </li></ul></ul><ul><ul><li>19% in X Window system [Baker1995] </li></ul></ul><ul><li>Copy-pasted code is error prone </li></ul><ul><ul><li>Among 35 errors in Linux drivers/i2o, 34 are caused by copy-paste [Chou2001] </li></ul></ul>Forget to change! void __init prom_meminit(void) { …… for (i=0; i<n; i++) { total [i].adr = list[i].addr; total [i].bytes = list[i].size; total [i].more = & total [i+1]; } …… for (i=0; i<n; i++) { taken [i].adr = list[i].addr; taken [i].bytes = list[i].size; taken [i].more = & total [i+1]; } for (i=0; i<n; i++) { total [i].adr = list[i].addr; total [i].bytes = list[i].size; total [i].more = & total [i+1]; } (Simplified example from linux-2.6.6/arch/sparc/prom/memory.c )
- 51. An Overview of Copy-Paste Bug Detection Parse source code & build a sequence database Mine for basic copy-pasted segments Compose larger copy-pasted segments Prune false positives
- 52. Parsing Source Code <ul><li>Purpose: building a sequence database </li></ul><ul><li>Idea: statement number </li></ul><ul><ul><li>Tokenize each component </li></ul></ul><ul><ul><li>Different operators/constant/key words different tokens </li></ul></ul><ul><li>Handle identifier renaming: </li></ul><ul><ul><li>same type of identifiers same token </li></ul></ul>5 61 20 old = 3; Tokenize Hash 16 new = 3; 5 61 20 Hash 16
- 53. Building Sequence Database <ul><li>Program a long sequence </li></ul><ul><ul><li>Need a sequence database </li></ul></ul><ul><li>Cut the long sequence </li></ul><ul><ul><li>Naïve method: fixed length </li></ul></ul><ul><ul><li>Our method: basic block </li></ul></ul>for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } …… for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } 65 16 16 71 … 65 16 16 71 Final sequence DB: (65) (16, 16, 71) … (65) (16, 16, 71) Hash values
- 54. Mining for Basic Copy-pasted Segments <ul><li>Apply frequent sequence mining algorithm on the sequence database </li></ul><ul><li>Modification </li></ul><ul><ul><li>Constrain the max gap </li></ul></ul>total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; (16, 16, 71) …… (16, 16, 10, 71) (16, 16, 71) …… (16, 16, 71) Frequent subsequence Insert 1 statement ( gap = 1)
- 55. Composing Larger Copy-Pasted Segments <ul><li>Combine the neighboring copy-pasted segments repeatedly </li></ul>for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } 65 65 16 16 71 16 16 71 65 16 16 71 65 16 16 71 Hash values copy-pasted …… combine combine
- 56. Pruning False Positives <ul><li>Unmappable segments </li></ul><ul><ul><li>Identifier names cannot be mapped to corresponding ones </li></ul></ul><ul><li>Tiny segments </li></ul><ul><li>For more detail, see </li></ul><ul><ul><li>Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation , 2004 </li></ul></ul>f (a1); f (a2); f (a3); f1 (b1); f1 (b2); f2 (b3); conflict
- 57. Some Test Results of C-P Bug Detection 0 2 PostgreSQL 0 5 Apache 8 23 FreeBSD 21 28 Linux Potential Bugs (careless programming) Verified Bugs Software 458 K PostgreSQL 224 K Apache 3.3 M FreeBSD 4.4 M Linux # LOC Software Space (MB) Time Software 57 38 secs PostgreSQL 30 15 secs Apache 459 20 mins FreeBSD 527 20 mins Linux
- 58. Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
- 59. Conclusions <ul><li>Data mining into software and computer systems </li></ul><ul><li>Identify incorrect executions from program runtime behaviors </li></ul><ul><li>Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs </li></ul><ul><li>A hypothesis testing-like approach is developed to localize logic bugs in software </li></ul><ul><li>No prior knowledge about the program semantics is assumed </li></ul><ul><li>Lots of other software bug mining methods should be and explored </li></ul>
- 60. Future Research: Mining into Computer Systems <ul><li>Huge volume of data from computer systems </li></ul><ul><ul><li>Persistent state interactions, event logs, network logs, CPU usage, … </li></ul></ul><ul><li>Mining system data for … </li></ul><ul><ul><li>Reliability </li></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Manageability </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Challenges in data mining </li></ul><ul><ul><li>Statistical modeling of computer systems </li></ul></ul><ul><ul><li>Online, scalability, interpretability … </li></ul></ul>
- 61. References <ul><li>[DRL+98] David L. Detlefs, K. Rustan, M. Leino, Greg Nelson and James B. Saxe. Extended static checking, 1998 </li></ul><ul><li>[EGH+94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint: A tool for using specifications to check code. In Proceedings of the ACM SIG-SOFT '94 Symposium on the Foundations of Software Engineering, pages 87-96, 1994. </li></ul><ul><li>[DLS02] Manuvir Das, Sorin Lerner, and Mark Seigle. Esp: Path-sensitive program verication in polynomial time. In Conference on Programming Language Design and Implementation, 2002. </li></ul><ul><li>[ECC00] D.R. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specic, programmer-written compiler extensions. In Proc. 4th Symp. Operating Systems Design and Implementation, October 2000. </li></ul><ul><li>[M93] Ken McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993 </li></ul><ul><li>[H97] Gerard J. Holzmann. The model checker SPIN. Software Engineering, 23(5):279-295, 1997. </li></ul><ul><li>[DDH+92] David L. Dill, Andreas J. Drexler, Alan J. Hu, and C. Han Yang. Protocol verication as a hardware design aid. In IEEE Int. Conf. Computer Design: VLSI in Computers and Processors, pages 522-525, 1992. </li></ul><ul><li>[MPC+02] M. Musuvathi, D. Y.W. Park, A. Chou, D. R. Engler and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. In Proc. 5th Symp. Operating Systems Design and Implementation, 2002. </li></ul>
- 62. References (cont’d) <ul><li>[G97] P. Godefroid. Model Checking for Programming Languages using VeriSoft. In Proc. 24th ACM Symp. Principles of Programming Languages, 1997 </li></ul><ul><li>[BHP+-00] G. Brat, K. Havelund, S. Park, and W. Visser. Model checking programs. In IEEE Int.l Conf. Automated Software Engineering (ASE), 2000. </li></ul><ul><li>[HJ92] R. Hastings and B. Joyce. Purify: Fast Detection of Memory Leaks and Access Errors. 1991. in Proc. Winter 1992 USENIX Conference, pp. 125-138. San Francisco, California </li></ul><ul><li>Chao Liu, Xifeng Yan, and Jiawei Han, “ Mining Control Flow Abnormality for Logic Error Isolation, ” in Proc. 2006 SIAM Int. Conf. on Data Mining (SDM'06), Bethesda, MD, April 2006. </li></ul><ul><li>C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “ SOBER: Statistical Model-based Bug Localization ”, in Proc. 2005 ACM SIGSOFT Symp. Foundations of Software Engineering (FSE 2005), Lisbon, Portugal, Sept. 2005. </li></ul><ul><li>C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “ Mining Behavior Graphs for Backtrace of Noncrashing Bugs ”, in Proc. 2005 SIAM Int. Conf. on Data Mining (SDM'05), Newport Beach, CA, April 2005. </li></ul><ul><li>[SN00] Julian Seward and Nick Nethercote. Valgrind, an open-source memory debugger for x86-GNU/Linux http:// valgrind.org / </li></ul><ul><li>[LLM+04] Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 </li></ul><ul><li>[LCS+04] Zhenmin Li, Zhifeng Chen, Sudarshan M. Srinivasan, Yuanyuan Zhou. C-Miner: Mining Block Correlations in Storage Systems. In pro. 3rd USENIX conf. on file and storage technologies, 2004 </li></ul>
Representative Publications

Chao Liu , Long Fei, Xifeng Yan, Jiawei Han and Samuel Midkiff, "Statistical Debugging: A Hypothesis Testing-Based Approach," IEEE Transaction on Software Engineering , Vol. 32, No. 10, pp. 831-848, Oct., 2006.

Chao Liu and Jiawei Han, "R-Proximity: Failure Proximity Defined via Statistical Debugging," IEEE Transaction on Software Engineering , Sept. 2006. (under review)

Chao Liu , Zeng Lian and Jiawei Han, "How Bayesians Debug", the 6th IEEE International Conference on Data Mining, pp. pp. 382-393,Hong Kong, China, Dec. 2006.

Chao Liu and Jiawei Han, "Failure Proximity: A Fault Localization-Based Approach", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 286-295, Portland, USA, Nov. 2006.

Chao Liu , "Fault-aware Fingerprinting: Towards Mutualism between Failure Investigation and Statistical Debugging", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Portland, USA, Nov. 2006.

Chao Liu , Chen Chen, Jiawei Han and Philip S. Yu, "GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis", the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 872-881, Philadelphia, USA, Aug. 2006.

Qiaozhu Mei, Chao Liu , Hang Su and Chengxiang Zhai, "A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs", the 15th International Conference on World Wide Web, pp. 533-542, Edinburgh, Scotland, May, 2006.

Chao Liu , Xifeng Yan and Jiawei Han, "Mining Control Flow Abnormality for Logic Error Isolation", 2006 SIAM International Conference on Data Mining, pp. 106-117, Bethesda, US, April, 2006.

Chao Liu , Xifeng Yan, Long Fei, Jiawei Han and Samuel Midkiff, "SOBER: Statistical Model-Based Bug Localization", the 5th joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 286-295, Lisbon, Portugal, Sept. 2005.

William Yurcik and Chao Liu . "A First Step Toward Detecting SSH Identity Theft on HPC Clusters: Discriminating Cluster Masqueraders Based on Command Behavior" the 5th International Symposium on Cluster Computing and the Grid, pp. 111-120, Cardiff, UK, May 2005.

Chao Liu , Xifeng Yan, Hwanjo Yu, Jiawei Han and Philip S. Yu, "Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs", In Proc. 2005 SIAM Int. Conf. on Data Mining, pp. 286-297, Newport Beach, US, April, 2005.
- 66. Example of Noncrashing Bugs void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } }
- 67. Debugging Crashes Crashing Bugs
- 68. Bug Localization via Backtrace <ul><li>Can we circle out the backtrace for noncrashing bugs? </li></ul><ul><li>Major challenges </li></ul><ul><ul><li>We do not know where abnormality happens </li></ul></ul><ul><li>Observations </li></ul><ul><ul><li>Classifications depend on discriminative features, which can be regarded as a kind of abnormality </li></ul></ul><ul><ul><li>Can we extract backtrace from classification results? </li></ul></ul>
- 69. Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
- 70. Related Work <ul><li>Crashing bugs </li></ul><ul><ul><li>Memory access monitoring </li></ul></ul><ul><ul><ul><li>Purify [HJ92], Valgrind [SN00] … </li></ul></ul></ul><ul><li>Noncrashing bugs </li></ul><ul><ul><li>Static program analysis </li></ul></ul><ul><ul><li>Traditional model checking </li></ul></ul><ul><ul><li>Model checking source code </li></ul></ul>
- 71. Static Program Analysis <ul><li>Methodology </li></ul><ul><ul><li>Examine source code directly </li></ul></ul><ul><ul><li>Enumerate all the possible execution paths without running the program </li></ul></ul><ul><ul><li>Check user-specified properties, e.g. </li></ul></ul><ul><ul><ul><li>free(p) …… (*p) </li></ul></ul></ul><ul><ul><ul><li>lock(res) …… unlock(res) </li></ul></ul></ul><ul><ul><ul><li>receive_ack() … … send_data() </li></ul></ul></ul><ul><li>Strengths </li></ul><ul><ul><li>Check all possible execution paths </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Shallow semantics </li></ul></ul><ul><ul><li>Properties can be directly mapped to source code structure </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>ESC [DRL+98], LCLint [EGH+94], ESP [DLS02], MC Checker [ECC00] … </li></ul></ul>×
- 72. Traditional Model Checking <ul><li>Methodology </li></ul><ul><ul><li>Formally model the system under check in a particular description language </li></ul></ul><ul><ul><li>Exhaustive exploration of the reachable states in checking desired or undesired properties </li></ul></ul><ul><li>Strengths </li></ul><ul><ul><li>Model deep semantics </li></ul></ul><ul><ul><li>Naturally fit in checking event-driven systems, like protocols </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Significant amount of manual efforts in modeling </li></ul></ul><ul><ul><li>State space explosion </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>SMV [M93], SPIN [H97], Murphi [DDH+92] … </li></ul></ul>
- 73. Model Checking Source Code <ul><li>Methodology </li></ul><ul><ul><li>Run real program in sandbox </li></ul></ul><ul><ul><li>Manipulate event happenings, e.g., </li></ul></ul><ul><ul><ul><li>Message incomings </li></ul></ul></ul><ul><ul><ul><li>the outcomes of memory allocation </li></ul></ul></ul><ul><li>Strengths </li></ul><ul><ul><li>Less significant manual specification </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Application restrictions, e.g., </li></ul></ul><ul><ul><ul><li>Event-driven programs (still) </li></ul></ul></ul><ul><ul><ul><li>Clear mapping between source code and logic event </li></ul></ul></ul><ul><li>Tools </li></ul><ul><ul><li>CMC [MPC+02], Verisoft [G97], Java PathFinder [BHP+-00] … </li></ul></ul>
- 74. Summary of Related Work <ul><li>In common, </li></ul><ul><ul><li>Semantic inputs are necessary </li></ul></ul><ul><ul><ul><li>Program model </li></ul></ul></ul><ul><ul><ul><li>Properties to check </li></ul></ul></ul><ul><ul><li>Application scenarios </li></ul></ul><ul><ul><ul><li>Shallow semantics </li></ul></ul></ul><ul><ul><ul><li>Event-driven system </li></ul></ul></ul>
- 75. Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
- 76. Example Revisited void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } <ul><li>No memory violations </li></ul><ul><li>Not event-driven program </li></ul><ul><li>No explicit error properties </li></ul>
- 77. Identification of Incorrect Executions <ul><li>A two-class classification problem </li></ul><ul><ul><li>How to abstract program executions </li></ul></ul><ul><ul><ul><li>Program behavior graph </li></ul></ul></ul><ul><ul><li>Feature selection </li></ul></ul><ul><ul><ul><li>Edges + Closed frequent subgraphs </li></ul></ul></ul><ul><li>Program behavior graphs </li></ul><ul><ul><li>Function-level abstraction of program behaviors </li></ul></ul>int main(){ ... A(); ... B(); } int A(){ ... } int B(){ ... C() ... } int C(){ ... }
- 78. Values of Classification <ul><li>A graph classification problem </li></ul><ul><ul><li>Every execution gives one behavior graph </li></ul></ul><ul><ul><li>Two sets of instances: correct and incorrect </li></ul></ul><ul><li>Values of classification </li></ul><ul><ul><li>Classification itself does not readily work for bug localization </li></ul></ul><ul><ul><ul><li>Classifier only labels each run as either correct or incorrect as a whole </li></ul></ul></ul><ul><ul><ul><li>It does not tell when abnormality happens </li></ul></ul></ul><ul><ul><li>Successful classification relies on discriminative features </li></ul></ul><ul><ul><ul><li>Can discriminative features be treated as a kind of abnormality? </li></ul></ul></ul><ul><ul><li>When abnormality happens? </li></ul></ul><ul><ul><ul><li>Incremental classification? </li></ul></ul></ul>?
- 79. Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
- 80. Incremental Classification <ul><li>Classification works only when instances of two classes are different. </li></ul><ul><li>So that we can use classification accuracy as a measure of difference. </li></ul><ul><li>Relate classification dynamics to bug relevant functions </li></ul>
- 81. Illustration: Precision Boost One Correct Execution One Incorrect Execution main main A A B C D B C D E E F G F G H
- 82. Bug Relevance <ul><li>Precision boost </li></ul><ul><ul><li>For each function F : </li></ul></ul><ul><ul><ul><li>Precision boost = Exit precision - Entrance precision. </li></ul></ul></ul><ul><ul><li>Intuition </li></ul></ul><ul><ul><ul><li>Differences take place within the execution of F </li></ul></ul></ul><ul><ul><ul><li>Abnormalities happens while F is in the stack </li></ul></ul></ul><ul><ul><ul><li>The larger this precision boost, the more likely F is part of the backtrace </li></ul></ul></ul><ul><li>Bug-relevant function </li></ul>
- 83. Outline <ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Case Study </li></ul><ul><li>Conclusions </li></ul>
- 84. Case Study <ul><li>Subject program </li></ul><ul><ul><li>replace: perform regular expression matching and substitutions </li></ul></ul><ul><ul><li>563 lines of C code </li></ul></ul><ul><ul><li>17 functions are involved </li></ul></ul><ul><li>Execution behaviors </li></ul><ul><ul><li>130 out of 5542 test cases fail to give correct outputs </li></ul></ul><ul><ul><li>No incorrect executions incur segmentation faults </li></ul></ul><ul><li>Logic bug </li></ul><ul><ul><li>Can we circle out the backtrace for this bug? </li></ul></ul>void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } }
- 85. Precision Pairs
- 86. Precision Boost Analysis <ul><li>Objective judgment of bug relevant functions </li></ul><ul><li>main function is always bug relevant </li></ul><ul><li>Stepwise precision boost </li></ul><ul><li>Line-up property </li></ul>
- 87. Backtrace for Noncrashing Bugs
- 88. Method Summary <ul><li>Identify incorrect executions from program runtime behaviors </li></ul><ul><li>Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs </li></ul><ul><li>Data mining can contribute to software engineering and system researches in general </li></ul>
- 89. Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
- 90. An Example <ul><li>Replace program: 563 lines of C code, 20 functions </li></ul><ul><li>Symptom: 30 out of 5542 test cases fail to give correct outputs, and no crashes </li></ul><ul><li>Goal: Localizing the bug, and prioritizing manual examination </li></ul><ul><li>void dodash(char delim, char *src, int *i, char *dest, int *j, int maxset) </li></ul><ul><li>{ </li></ul><ul><li>while (…){ </li></ul><ul><li>… </li></ul><ul><li>if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1] ){ </li></ul><ul><ul><li>for(k = src[*i-1]+1; k<=src[*i+1]; k++) </li></ul></ul><ul><ul><li>junk = addst(k, dest, j, maxset); </li></ul></ul><ul><ul><li>*i = *i + 1; </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>*i = *i + 1; </li></ul></ul><ul><li>} </li></ul><ul><li>} </li></ul>
- 91. Difficulty & Expectation <ul><li>Difficulty </li></ul><ul><ul><li>Statically, even small programs are complex due to dependencies </li></ul></ul><ul><ul><li>Dynamically, execution paths can vary significantly across all possible inputs </li></ul></ul><ul><ul><li>Logic errors have no apparent symptoms </li></ul></ul><ul><li>Expectations </li></ul><ul><ul><li>Unrealistic to fully unload developers </li></ul></ul><ul><ul><li>Localize buggy region </li></ul></ul><ul><ul><li>Prioritize manual examination </li></ul></ul>
- 92. Execution Profiling <ul><li>Full execution trace </li></ul><ul><ul><li>Control flow + value tags </li></ul></ul><ul><ul><li>Too expensive to record at runtime </li></ul></ul><ul><ul><li>Unwieldy to process </li></ul></ul><ul><li>Summarized control flow for conditionals (if, while, for) </li></ul><ul><ul><li>Branch evaluation counts </li></ul></ul><ul><ul><li>Lightweight to take at runtime </li></ul></ul><ul><ul><li>Easy to process and effective </li></ul></ul>
- 93. Analysis of the Example <ul><ul><li>A = isalnum(isalnum(src[*i+1])) </li></ul></ul><ul><ul><li>B = src[*i-1]<=src[*i+1] </li></ul></ul><ul><li>An execution is logically correct until (A ^ ¬B) is evaluated as true when the evaluation reaches this condition </li></ul><ul><li>If we monitor the program conditionals like A here, their evaluation will shed light on the hidden error and can be exploited for error isolation </li></ul><ul><li>if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){ </li></ul><ul><ul><li>for(k = src[*i-1]+1; k<=src[*i+1]; k++) </li></ul></ul><ul><ul><li>junk = addst(k, dest, j, maxset); </li></ul></ul><ul><ul><li>*i = *i + 1; } </li></ul></ul>
- 94. Analysis of Branching Actions <ul><li>Correct vs. in correct runs in program P </li></ul><ul><li>AS we tested through 5542 test cases, the true eval prob for (A^¬B) is 0.727 in a correct and 0.896 in an incorrect execution on average </li></ul><ul><li>Error location does exhibit detectable abnormal behaviors in incorrect executions </li></ul>n ¬A¬B n A¬B = 0 ¬B n ¬AB n AB B ¬A A n ¬A¬B n A¬B ≥1 ¬B n ¬AB n AB B ¬A A
- 95. Conditional Test Works for Nonbranching Errors <ul><li>Off-by-one error can still be detected using the conditional tests </li></ul>Void makepat (char *arg, int start, char delim, char *pat) { … if (!junk) result = 0; else result = i + 1; /* off-by-one error */ /* should be: result = i */ return result; }
- 96. Ranking Based on Boolean Bias <ul><li>Let input d i has a desired output o i . We execute P. P passes the test iff o i ’ is identical to o i </li></ul><ul><ul><li>T p = {t i | o i ’= P (d i ) matches o i } </li></ul></ul><ul><ul><li>T f = {t i | o i ’= P (d i ) does not match o i } </li></ul></ul><ul><li>Boolean bias: </li></ul><ul><ul><li>n t : # times that a boolean feature B evaluates true, similar for n f </li></ul></ul><ul><ul><li>Boolean bias: π (B) = (n t – n f )/(n t + n f ) </li></ul></ul><ul><ul><li>It encodes the distribution of B’s value: 1 if B always assumes true, -1 if always false, in between for all the other mixtures </li></ul></ul>
- 97. Evaluation Abnormality <ul><li>Boolean bias for branch P </li></ul><ul><ul><li>the probability of being evaluated as true within one execution </li></ul></ul><ul><li>Suppose we have n correct and m incorrect executions, for any predicate P , we end up with </li></ul><ul><ul><li>An observation sequence for correct runs </li></ul></ul><ul><ul><ul><li>S_p = (X’_1, X’_2, …, X’_n) </li></ul></ul></ul><ul><ul><li>An observation sequence for incorrect runs </li></ul></ul><ul><ul><ul><li>S_f = (X_1, X_2, …, X_m) </li></ul></ul></ul><ul><li>Can we infer whether P is suspicious based on S_p and S_f ? </li></ul>
- 98. Underlying Populations <ul><li>Imagine the underlying distribution of boolean bias for correct and incorrect executions are f(X| θ p ) and f(X| θ f ) </li></ul><ul><li>S_p and S_f can be viewed as random sample from the underlying populations respectively </li></ul><ul><li>Major heuristic: The larger the divergence between f(X| θ p ) and f(X| θ f ), the more relevant the branch P is to the bug </li></ul>0 1 Prob Evaluation bias 0 1 Prob Evaluation bias
- 99. Major Challenges <ul><li>No knowledge of the closed forms of both distributions </li></ul><ul><li>Usually, we do not have sufficient incorrect executions to estimate f(X| θ f ) reliably. </li></ul>0 1 Prob Evaluation bias 0 1 Prob Evaluation bias
- 100. Our Approach: Hypothesis Testing
- 101. Faulty Functions <ul><li>Motivation </li></ul><ul><ul><li>Bugs are not necessarily on branches </li></ul></ul><ul><ul><li>Higher confidence in function rankings than branch rankings </li></ul></ul><ul><li>Abnormality score for functions </li></ul><ul><ul><li>Calculate the abnormality score for each branch within each function </li></ul></ul><ul><ul><li>Aggregate them </li></ul></ul>
- 102. Two Evaluation Measures <ul><li>CombineRank </li></ul><ul><ul><li>Combine these score by summation </li></ul></ul><ul><ul><li>Intuition: When a function contains many abnormal branches, it is likely bug-relevant </li></ul></ul><ul><li>UpperRank </li></ul><ul><ul><li>Choose the largest score as the representative </li></ul></ul><ul><ul><li>Intuition: When a function has one extremely abnormal branch, it is likely bug-relevant </li></ul></ul>
- 103. Dodash vs. Omatch: Which function is likely buggy?─And Which Measure is More Effective?
- 104. Bug Benchmark <ul><li>Bug benchmark </li></ul><ul><ul><li>Siemens Program Suite </li></ul></ul><ul><ul><ul><li>89 variants of 6 subject programs, each of 200-600 LOC </li></ul></ul></ul><ul><ul><ul><li>89 known bugs in total </li></ul></ul></ul><ul><ul><ul><li>Mainly logic (or semantic) bugs </li></ul></ul></ul><ul><ul><li>Widely used in software engineering research </li></ul></ul>
- 105. Results on Program “replace”
- 106. Comparison between CombineRank and UpperRank <ul><li>Buggy function ranked within top-k </li></ul>
- 107. Results on Other Programs
- 108. More Questions to Be Answered <ul><li>What will happen (i.e., how to handle) if multiple errors exist in one program? </li></ul><ul><li>How to detect bugs if only very few error test cases are available? </li></ul><ul><li>Is it really more effective if we have more execution traces? </li></ul><ul><li>How to integrate program semantics in this statistics-based testing algorithm? </li></ul><ul><li>How to integrate program semantics analysis with statistics-based analysis? </li></ul>

