Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions.
This work is basically about how to automatically localize the software bugs. The major motivation is that software is full of bugs. A research once showed that the average error rate is 1 – 4.5 errors per 1000 lines of code. For example, the windows 2000, which has 35M lines of code, contains 63 thousands of KNOWN bugs at the time of its release. This means 2 errors are in each thousand lines. When the bugs happen in practice, the costs are tremendous. In 1996, the Ariane 5 exploded 40 seconds after lauching. As investigated, the explosion was due to errors in the software of the inertial reference system. A study by the National Institute of Standards and Technology found that the software errors cost the U.S. economy about $59.5 billions annually. Therefore, great many efforts are put on the testing and debugging during the software cycletime. Bill Gates once said that 50% of my company employees are testers, and the rest spends 50% of their time testing. As we all know, testing and debugging are tough task. So there are some researches carried out on bug localization.
Software bugs cause program failures, and developers need to access program failures to debug. Recent years have seen a software practice, known as automated program failure reporting. I believe you have ever seen windows like this. Whenever …, a window will pop up asking you whether you’d like to send … to central server for software vendors to diagnose. As failure reports contain valuable information about program failures in reality , they prove very useful to help software vendors to enhance their software. Because of its usefulness, automated failure reporting has been widely adopted, such as in Linux, Mozilla applications, as well as in Microsoft windows. In fact, by using third party libraries or Windows APIs, any applications can implement their own failure reporting functionality.
But after failures are collected, we cannot randomly pick up a failure, and randomly ask a developer to diagnose it. Instead, we need to identify the most severe failures and assign them to the appropriate developers , which is the so-called failure triage. Specifically, two tasks need to be handled, first failure prioritization and second failure assignment. The severity of a bug is determined by the number of failures caused by the bug. Usually, as the most frequently reported failures are the most severe, we need to identify failures likely due to the same bug. On the other hand, failure assignment means locating the appropriate developers to diagnose a given set of failures.
Although semantic bugs look tricky, they are by no means rare. According to a recent study of bug characteristics, semantic bugs actually dominate, accounting for about 78% of all the bugs. In contrast, memory bugs only account for 16%. [Why] This is because in recent years, a lot of memory checking tools have been developed for memory sanity, and they are actually used in practice. Because semantic bugs become dominant, we need to pay more attention to semantic bugs.
Unfortunately, hacking semantic bugs is hard because of the absence of crashes. With no crashes, we have no failure signature so that failure triage becomes elusive. Still without crashes, developers will have few hints on where the bug could be. Our contribution is that 1) we developed a statistical debugging algorithm that can automatically localize the semantic bugs without any prior knowledge of program semantics. and 2) we found that statistical debugging can be used to triage noncrashing failures.
Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions.
Let us use an example to explain statistical debugging. Here is a function in a buggy program. Suppose we know this is the buggy function, and where is the bug. It turns out that the bug is here. There could have been a conjunction of two subclauses . However, the developer forgot one subclause, and in consequence, 130 out of 5542 cases fail to give the correct outputs, and in particular, no crashes for any failures. Conventionally , for such semantic bugs, software developers need to find a failing execution, and trace it step by step. However our tool SOBER can pinpoint the real bug location immediately, so that software developers can first examine this part, set up breakpoint here and check for abnormality. How does our tool work? We first instrument source code with predicates. In general, predicates can be about any program properties. In particular, we instrument two kinds of predicates. This first one is boolean predicate: for every boolean expression, we instrument a predicate that the boolean value is true. The second category of predicate is about function calls. For every function call, we instrument three predicates that the return value is less than, equal to and greater than 0. [like tossing a coin] For any predicate, every time its associated source code is executed, the predicate is evaluated, and every evaluation is either true or false, just like tossing a coin. For example, if this while loop is executed 6 times, 5 times true and 1 time false, the number of true and false evaluation is recorded, which is the predicate profile of the execution.
Then we can concatenate the predicate profiles of each predicate, and get a vector representation of each execution. Suppose we have three executions of the function, two passing and one failing, then we may get representation like this. Intuitively , if a predicate is always false in passing executions, and happen to be true in all failing executions, then this predicate is very likely related to program failures. Our tool generalizes such intuition, and tries to find predicates having divergent true probability.
In order to identify the divergence in head probability , we define a random variable to estimate the head probability from every execution, and it is called the evaluation bias. Specifically, evaluation bias is the percentage of true evaluations for the predicate in one execution. We treat each predicate independently
Therefore, multiple evaluation biases are observed from multiple executions. And these evaluation biases can be treated as generated from the underlying models for passing and failing executions. Then we want to quantify the divergence between the two models, and the larger divergence, the more likely the corresponding predicate is fault relevant. The rationale is that this predicate shows the largest divergence.
However, there are two major challenges to quantify the model divergence directly. First, we have no idea of the closed form of either model. Second, even though it is possible … So it is hard to quantify the divergence directly, so we may consider an indirect approach.
We proposed a hypothesis testing-based indirect approach to quantify the model divergence. In order to …, we first …, and then we derive a test statistic, which conforms to a normal distribution based on central limit theorem. Intuitively, the value corresponds to the likelihood of observing the evaluation biases from failing executions as if they were generated from the model of passing executions. So the smaller the value, the more likely the null hypothesis is not true, and the larger the divergence between the two models, and finally, the predicate P is more fault relevant.
Here is a summary of SOBER We are given the source code, we first instrument it with program predicates. Then we run a test suite, and get the predicate profiles for both passing and failing executions. This step needs a test oracle to identify the set of passing and failing executions. Then SOBER takes in the predicate profiles, and generate a ranked list of all instrumented predicates, and top predicates point to the blamed bug location.
The idea of statistical debugging is not new. Previously, Ben Liblit proposed a statistical debugging algorithm based on correlation analysis. Specifically, for each predicate, they estimate the probability that the program fails if the predicate is ever evaluated, which they call Context(P), and they also estimate the probability that the program fails if the predicate P is ever evaluated as true. Then finally, they take the difference as the fault relevance score for the predicate. And intuitively, they estimate how more likely the program fails when a predicate is ever evaluated true.
Now we use a simple example to illustrate their algorithm. The discrimination of Liblit05 relies on the proportion of executions where P is evaluated only as false. A larger value implies higher fault-relevance
In comparison, SOBER adopts a fundamentally different approach. We learn a model … and do the contrast. As a direct approach is infeasible, we proposed an indirect approach to quantifying the divergence between the two model.
Besides the difference in methodology, SOBER also differs from Liblit05 in terms of utilized information. Line 8 likely executes SOBER can solve it because X_f = 0.9024, and X_p = 0.2952.
Now we need to evaluate the debugging quality of SOBER, and we use T-score as the quality metric. Intuitively, T-score quantifies how close the blamed bug location is to the real bug location. And still we use this example to explain how T-score is computed, and why it is a reasonable measure. Represent program as a PDG, where each node represents a certain piece of code, and edges represent the data and control dependences. Mark out the real bug locations Mark out blamed locations Breath-first search from blamed locations until reaching real bug location T-score is the percentage of covered PDG Intuitively, it estimates the percentage of code that needs to be examined by software developers before locating the bug if he/she examines the code along dependencies. Why is it a good measure? Objective Reflect the needed manual efforts The above is the two reasons why it is widely used.
So now you have understood how to use T-score to quantify the debugging result. Now here is the comparison between SOBER, Liblit05 and another debugging algorithm CT, which claimed the best result on the Siemen suite before SOBER. The x-axis shows the T-score, and the y-axis shows what percentage of the 130 bugs can be located without no more than a certain percentage of code examination. For example, at t-score equal to 10%, SOBER can identify 52% of the 130 bugs without no more 10% code examination, and in comparison, Liblit05 identifies 40%. When a developer is OK with examining no more than 20% of the code, SOBER identifies 73% while Liblit05 is about 63%. We note that T-score more than 20% is generally meaningless in the sense that …
Because the Siemens program suite mainly contains small programs, we also did a set of case studies on reasonably large programs. They are flex, grep, and gzip. The subject programs, bugs and test suites are all obtained from the SIR, and the number of lines of code are shown here, and the T-score from SOBER is listed in the last column.
Let’s take a quick look at those bugs, and get a sense of how tricky semantic bugs could be.
In the following, let’s examine two cases where SOBER performs well, and one case where SOBER performed badly.
This is the first bug in grep-2.2, which has 11.826 LOC. There are 3136 predicates instrumented. The bug is at line 553, and the plus 1 shouldn’t be there, and this bug causes 48 out of 470 cases fail, and after running SOBER, we get a predicate ranking, whose top predicates are P1470 and P1484, which point to …. With the two points identified, we then check where the variables beg and lastout are assigned, and it turns out that the buggy line is the only place where the variable beg is assigned. So with our debugging tool SOBER, developers can easily identify the bug. Otherwise, he/she needs to hunt for the bug in more than 11K lines of code.
For the second bug in grep, the highest predicate identified by SOBER is P1952, and it pinpoints to the bug location.
Certainly, SOBER is not a silver bullet, and it cannot guarantee general effective for all semantic bugs because semantic bugs can be very very tricky. The 5th bug in Flex is an example. The constant should have been assigned to chk[offset], rather than chk[offset – 1]. But since the wrong value is over-written with correct value, so no wrong value in chk[offset – 1]. In addition, chk[offset] is not used here but later, so this is a very tricky bug, and even human beings may find it hard to debug.
How to capture value abnormity is a future work. Subclause-missing bug is relatively easy to locate, which can be justified by reasoning, details are in our paper. Off-by-one can be easy or hard to detect, depending on how significant the off-by-one value affect the control flow as currently, SOBER mainly monitors booleans. which are relatively equivalent to control flows. Bug 1 of Grep-2.2 is good because the wrong value is immediately used for flow control Bug 5 of Flex is bad because the wrong value is never used for control flow, but only value errors. For the rest 4 various kinds of bugs, although they look wild, they can be detected because of control flow anormaly In conclusion, SOBER is more sensitive to abnormal program flows, than to abnormal values. This could partially be due to our current predicates do not capture value abnormality. How to capture value abnormality information through predicates is an interesting question.
While we have been demonstrating the effectiveness of SOBER for semantic bugs, it does not mean SOBER can do nothing with memory bugs. Instead, SOBER is also effective for memory bugs. In fact, we used SOBER to identify two memory bugs in bc 1.06, an arbitary precision calculator, and one bug has not previously reported. One important point that needs mentioning is that the blamed point by SOBER is NOT the crashing venue, but very close to the root cause. This first bug is unreported, the top predicate is as circled, as old_count is copied from v_count, by putting a watch on v_count, we notice that it is unexpectedly over-written. The second bug is the widely reported copy-paste bug in bc. The first predicate is a_count < v_count, and the shown v_count should be a_count.
Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions.
As a reminder, there are two major problems in failure triage.
One solution to address all the three problems is to CLUSTER failures, such that failures likely due to the same fault are clustered together. In particular, it would be better if we can visualize the clustering so that developers can intuitively analyze the relationship between failures. Now, let’s suppose we can. <click to show the plus signs> In this graph, you do not need to worry about what the x- and y-axes mean. What you need to know is that each cross represents a failure, and a small distance between two crosses means that the corresponding two failures are likely due to the same fault . This graph can help us tackle the three problem. First, we can visually identify the failure clusters. <click to draw the two circles>. Second, based on the cluster size, we can prioritize failure diagnosis. For example, the upper cluster likely represents the severest fault, and should be diagnosed first, followed by the lower cluster of failures. Finally, if each cluster automatically provides the likely fault location, like <click to show callouts>, failure clusters can be assigned to the appropriate developers automatically. In this talk, we will discuss how we can obtain a graph like this, which we call the failure proximity graph, and automatically find the fault location for each cluster.
As you may have identified, the central question of failure clustering is how to define a distance between failures, such that … Different distance definitions will result in different failure clusterings So our object is to find a distance measure such that failures due to different faults are well separated.
So how to define a distance? Previously, Podgurski proposed T-Proximity, which defines distances between failures based on the literal trace similarity. However, since not the entire trace is failure relevant, T-Proximity is not good at identifying failures due to the same bug. Therefore, we proposed another approach, called R-Proximity, which uses SOBER to find the likely bug location for every failure, and defines distances between failures based on the likely bug location. In fact,
Put failures with the same root cause (RC) together Inevitable manual work to find root causes In fact, what we just want is to avoid manual identification of the root cause for every failure This actually provides a general framework for failure triage. Explicitly, if we can find the root cause for each failure, then the clustering Based on the root causes would be the optimal. For example, given a set of failing executions, fail1, fail2 to failm, and a set of passing executions pass1, pass2 to passn, then a developer manually investigates each failure, and finds the root cause for each failure. Then, if RC2 and RC3 are the same, then failure 2 and failure3 are clustered together. Similarly, if RC1 and RCm are the same, failure1 and failurem are clustered together. As we can see, since the root cause can only be found through manual work, the optimal failure proximity is too expensive to obtain.
Ranking is an expression about preferences Ranking distance measures the preference disagreement Explain what predicates are instrumented in this study.
Our first case study is with Grep-2.2.
For both R-Proximity and T-Proximity, we calculate the pair-wise distances between failures, and then present these failures on 2-d dimensional space such that the pair-wise distances are best preserved on the 2-d space.
No matter what you circle, the predicates 1470 and 1484 always dominate
Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions.
Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions.
Most of these problems are noncrashing failures.
Now let’s take a look at how semantic bugs that incur no crashes can be located through statistical analysis.
From memory access point of view, even incorrect executions are correct.
Recall that in crashing bugs, memory accesses are obviously where abnormality happens so that the call stack constitutes the backtrace. Should we have known where abnormality happens, the call stack can also be the back trace of noncrashing bugs.
usually, this is a final state machine
When these methods do not work?
From memory access point of view, even incorrect executions are correct. , hence hard to model using finite state machines.
Behavior graph = call graph + transition graph One graph from one execution
The main idea of incremental classification is that we train classifiers at different stages of program executions so that we have chance to capture when the bug happens or where the abnormality is. Basically, the incorrect execution looks the same at the beginning of execution, and then at certain stage, the execution triggers the bug, then the execution diverge from a correct execution. So if we can
From memory access point of view, even incorrect executions are correct.
CP-Miner [LLM+04] detects copy-paste bugs in OS code uses Clospan algorithm C-Miner [LCS+04] discovers block correlations in storage systems again uses Clospan algorithm effectively reduces I/O response time … …
Had the function been written correctly, the subclause in red should have been there.
How to represent
Had the function been written correctly, the subclause in red should have been there.
Had the function been written correctly, the subclause in red should have been there.
If we knew them, some standard measures may apply, i.e., KL-divergence
With some derivation shown in paper.
Here comes the outline. We first discuss based on an example, which illustrates why logic errors are hard to deal with.
Transcript of "5/4/10 Data Mining: Principles and Algorithms "
3.
Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
4.
Software Bugs Are Costly <ul><li>Software is “full of bugs” </li></ul><ul><ul><li>Windows 2000, 35 million lines of code </li></ul></ul><ul><ul><ul><li>63,000 known bugs at the time of release, 2 per 1000 lines </li></ul></ul></ul><ul><li>Software failure costs </li></ul><ul><ul><li>Ariane 5 explosion due to “errors in the software of the inertial reference system” (Ariaen-5 flight 501 inquiry board report http://ravel.esrin.esa.it/docs/esa-x-1819eng.pdf ) </li></ul></ul><ul><ul><li>A study by the National Institute of Standards and Technology found that software errors cost the U.S. economy about $59.5 billion annually http://www.nist.gov/director/prog-ofc/report02-3.pdf </li></ul></ul><ul><li>Testing and debugging are laborious and expensive </li></ul><ul><ul><li>“ 50% of my company employees are testers, and the rest spends 50% of their time testing!” —Bill Gates, in 1995 </li></ul></ul>
5.
Automated Failure Reporting <ul><li>End-users as Beta testers </li></ul><ul><ul><li>Valuable information about failure occurrences in reality </li></ul></ul><ul><ul><li>24.5 million/day in Redmond (if all users send) – John Dvorak, PC Magazine </li></ul></ul><ul><li>Widely adopted because of its usefulness </li></ul><ul><ul><li>Microsoft Windows, Linux Gentoo, Mozilla applications … </li></ul></ul><ul><ul><li>Any applications can implement this functionality </li></ul></ul>
6.
After Failures Collected …: Failure triage <ul><li>Failure triage </li></ul><ul><ul><li>Failure prioritization: </li></ul></ul><ul><ul><ul><li>What are the most severe bugs? </li></ul></ul></ul><ul><ul><li>Failure assignment: </li></ul></ul><ul><ul><ul><li>Which developers should debug a given set of failures? </li></ul></ul></ul><ul><li>Automated debugging </li></ul><ul><ul><li>Where is the likely bug location? </li></ul></ul>
7.
A Glimpse on Software Bugs <ul><li>Crashing bugs </li></ul><ul><ul><li>Symptoms: segmentation faults </li></ul></ul><ul><ul><li>Reasons: memory access violations </li></ul></ul><ul><ul><li>Tools: Valgrind, CCured </li></ul></ul><ul><li>Noncrashing bugs </li></ul><ul><ul><li>Symptoms: unexpected outputs </li></ul></ul><ul><ul><li>Reasons: logic or semantic errors </li></ul></ul><ul><ul><ul><li>if ((m >= 0)) vs. if ((m >= 0) && (m != lastm )) </li></ul></ul></ul><ul><ul><ul><li>< vs. <=, > vs. >=, etc .. </li></ul></ul></ul><ul><ul><ul><li>j = i vs. j= i+1 </li></ul></ul></ul><ul><ul><li>Tools: No sound tools </li></ul></ul>
8.
Semantic Bugs Dominate <ul><li>Semantic Bugs : </li></ul><ul><li>Application specific </li></ul><ul><li>Only few detectable </li></ul><ul><li>Mostly require annotations or specifications </li></ul><ul><li>Memory-related Bugs: </li></ul><ul><li>Many are detectable </li></ul>Others Concurrency bugs <ul><li>Bug Distribution [Li et al., ICSE’07] </li></ul><ul><li>264 bugs in Mozilla and 98 bugs in Apache manually checked </li></ul><ul><li>29,000 bugs in Bugzilla automatically checked </li></ul>Courtesy of Zhenmin Li
9.
Hacking Semantic Bugs is HARD <ul><li>Major challenge: No crashes! </li></ul><ul><ul><li>No failure signatures </li></ul></ul><ul><ul><li>No debugging hints </li></ul></ul><ul><li>Major Methods </li></ul><ul><ul><li>Statistical debugging of semantic bugs [Liu et al., FSE’05, TSE’06] </li></ul></ul><ul><ul><li>Triage noncrashing failures through statistical debugging [Liu et al., FSE’06] </li></ul></ul>
10.
Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
11.
A Running Example void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ lastm = m; } if ((m == -1) || (m == i)){ i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } <ul><li>130 of 5542 test cases fail, no crashes </li></ul>5 1 Predicate evaluation as tossing a coin # of false # of true Predicate (lin[i] != ENDSTR)==true Ret_amatch > 0 Ret_amatch == 0 Ret_amatch < 0 (m >= 0) == true (m == i) == true (m >= -1) == true 5 1 1 5 1 5 4 2 2 4 1 5
12.
Profile Executions as Vectors <ul><li>Extreme case </li></ul><ul><ul><li>Always false in passing and always true in failing … </li></ul></ul><ul><li>Generalized case </li></ul><ul><ul><li>Different true probability in passing and failing executions </li></ul></ul>Two passing executions One failing execution 5 1 4 2 5 1 2 4 5 1 5 1 1 5 19 1 18 2 19 1 2 18 19 1 19 1 1 19 9 1 8 2 2 8 2 8 9 1 9 1 1 9
13.
Estimated Head Probability <ul><li>Evaluation bias </li></ul><ul><ul><li>Estimated head probability from every execution </li></ul></ul><ul><ul><li>Specifically, </li></ul></ul><ul><ul><li>where and are the number of true and false evaluations in one execution. </li></ul></ul><ul><ul><li>Defined for each predicate and each execution </li></ul></ul>
14.
Divergence in Head Probability <ul><li>Multiple evaluation biases from multiple executions </li></ul><ul><li>Evaluation bias as generated from models </li></ul>0 1 Prob Head Probability 0 1 Prob Head Probability
15.
Major Challenges <ul><li>No closed form of either model </li></ul><ul><li>No sufficient number of failing executions to estimate </li></ul>0 1 Prob Head Probability 0 1 Prob Head Probability
17.
SOBER in Summary Test Suite Source Code Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 6 Pred 1 Pred 3 SOBER
18.
Previous State of the Art [Liblit et al, 2005] <ul><li>Correlation analysis </li></ul><ul><ul><li>Context( P ) = Prob(fail | P ever evaluated) </li></ul></ul><ul><ul><li>Failure( P ) = Prob(fail | P ever evaluated as true ) </li></ul></ul><ul><ul><li>Increase( P ) = Failure( P ) – Context( P ) </li></ul></ul><ul><li>How more likely the program fails when a predicate is ever evaluated true </li></ul>
19.
Liblit05 in Illustration Failing Passing O + + + + + + O O O O O O O O O Context(P) = Prob(fail | P ever evaluated) = 4/10 = 2/5 Increase(P) = Failure(P) – Context(P) = 3/7 – 2/5 = 1/35 Failure(P) = Prob(fail | P ever evaluated as true) = 3/7
20.
SOBER in Illustration O + + + + + + O O O O O O O O O Failing Passing 0 1 Prob Evaluation bias 0 1 Prob Evaluation bias
21.
Difference between SOBER and Liblit05 <ul><li>Methodology: </li></ul><ul><ul><li>Liblit05: Correlation analysis </li></ul></ul><ul><ul><li>SOBER: Model-based approach </li></ul></ul><ul><li>void subline(char *lin, char *pat, char *sub) </li></ul><ul><li>{ </li></ul><ul><li>1 int i, lastm, m; </li></ul><ul><li>2 lastm = -1; </li></ul><ul><li>3 i = 0; </li></ul><ul><li>4 while((lin[i] != ENDSTR)) { </li></ul><ul><li>5 m = amatch(lin, i, pat, 0); </li></ul><ul><li>6 if (m >= 0){ </li></ul><ul><li>7 putsub(lin, i, m, sub); </li></ul><ul><li>8 lastm = m; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>11 } </li></ul><ul><li>Utilized information </li></ul><ul><ul><li>Liblit05: Ever true? </li></ul></ul><ul><ul><li>SOBER: What percentage is true? </li></ul></ul><ul><li>Liblit05: </li></ul><ul><li>Line 6 is ever true in most passing and failing exec. </li></ul><ul><li>SOBER: </li></ul><ul><li>Prone to be true in failing exec. </li></ul><ul><li>Prone to be false in passing exec. </li></ul>
22.
T-Score: Metric of Debugging Quality <ul><li>How close is the blamed to the real bug location? </li></ul>void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } T-score = 70%
23.
A Better Debugging Result void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; } } T-score = 40%
24.
Evaluation 1: Siemens Program Suite <ul><li>T-Score <= 20% is meaningful </li></ul><ul><li>Siemens program suite </li></ul><ul><ul><li>130 buggy versions of 7 small (<700LOC) programs </li></ul></ul><ul><li>What percentage bugs can be located with no more than % code examination </li></ul>
25.
Evaluation 2: Reasonably Large Programs Software-artifact Infrastructure Repository (SIR): http://sir.unl.edu 2.9% 17/217 Subclause-missing Bug 2 0.5% 65/217 Subclause-missing Bug 1 Gzip 1.2 (6,184 LOC) 0.2% 88/470 Subclause-missing Bug 2 0.6% 48/470 Off-by-one Bug 1 Grep 2.2 (11,826 LOC) 45.6% 92/525 Off-by-one Bug 5 15.4% 22/525 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) Bug 4 7.6% 69/525 Mis-assign value true for false Bug 3 1.6% 356/525 Misuse of = for == Bug 2 0.5% 163/525 Misuse >= for > Bug 1 Flex 2.4.7 (8,834 LOC) T-Score Failure Number Bug Type
27.
Evaluation 2: Reasonably Large Programs Software-artifact Infrastructure Repository (SIR): http://sir.unl.edu 2.9% 17/217 Subclause-missing Bug 2 0.5% 65/217 Subclause-missing Bug 1 Gzip 1.2 (6,184 LOC) 0.2% 88/470 Subclause-missing Bug 2 0.6% 48/470 Off-by-one Bug 1 Grep 2.2 (11,826 LOC) 45.6% 92/525 Off-by-one Bug 5 15.4% 22/525 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) Bug 4 7.6% 69/525 Mis-assign value true for false Bug 3 1.6% 356/525 Misuse of = for == Bug 2 0.5% 163/525 Misuse >= for > Bug 1 Flex 2.4.7 (8,834 LOC) T-Score Failure Number Bug Type
28.
A Close Look: Grep-2.2: Bug 1 <ul><li>11,826 lines of C code </li></ul><ul><li>3,136 predicates instrumented </li></ul><ul><li>48 out of 470 cases fail </li></ul>
29.
Grep-2.2: Bug 2 <ul><li>11,826 lines of C code </li></ul><ul><li>3,136 predicates instrumented </li></ul><ul><li>88 out of 470 cases fail </li></ul>
30.
No Silver Bullet: Flex Bug 5 No wrong value in chk[offset -1] chk[offset] is not used here but later <ul><li>8,834 lines of C code </li></ul><ul><li>2,699 predicates instrumented </li></ul>
31.
Experiment Result in Summary Effective for bugs demonstrating abnormal control flows 2.9% 17/217 Subclause-missing Bug 2 0.5% 65/217 Subclause-missing Bug 1 Gzip 1.2 (6,184 LOC) 0.2% 88/470 Subclause-missing Bug 2 0.6% 48/470 Off-by-one Bug 1 Grep 2.2 (11,826 LOC) 45.6% 92/525 Off-by-one Bug 5 15.4% 22/525 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) Bug 4 7.6% 69/525 Mis-assign value true for false Bug 3 1.6% 356/525 Misuse of = for == Bug 2 0.5% 163/525 Misuse >= for > Bug 1 Flex 2.4.7 ( 8,834 LOC) T-Score Failure Number Bug Type
32.
SOBER Handles Memory Bugs As Well <ul><li>bc 1.06: </li></ul><ul><li>Two memory bugs found with SOBER </li></ul><ul><li>One of them is unreported </li></ul><ul><li>Blamed location is NOT the crashing venue </li></ul>
33.
Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
34.
Major Problems in Failure Triage <ul><li>Failure Prioritization </li></ul><ul><ul><li>What failures are likely due to the same bug </li></ul></ul><ul><ul><li>What bugs are the most severe </li></ul></ul><ul><ul><li>Worst 1% bugs = 50% failures </li></ul></ul><ul><li>Failure Assignment </li></ul><ul><ul><li>Which developer should debug which set of failures? </li></ul></ul>Courtesy of Microsoft Corporation
35.
<ul><li>Failure indexing </li></ul><ul><ul><li>Identify failures likely due to the same bug </li></ul></ul>A Solution: Failure Clustering X Y 0 Most sever Less Severe Least Severe Fault in core.io ? Fault in function initialize() ? Failure Reports + + + + + + + + + + + + + + + + + + + + + +
36.
The Central Question: A Distance Measure between Failures <ul><li>Different measures render different clusterings </li></ul>X Y Dist. defined on X-axis Dist. defined on Y-axis 0 0 0 O O O O O O + + + + + + + O + X + + O O O O O O O O + + + + + + + Y
37.
How to Define a Distance <ul><li>Previous work [Podgurski et al., 2003] </li></ul><ul><ul><li>T-Proximity : Distance defined on literal trace similarity </li></ul></ul><ul><li>Our approach [Liu et al., 2006] </li></ul><ul><ul><li>R-Proximity : Distance defined on likely bug location </li></ul></ul>= SOBER
38.
Why Our Approach is Reasonable <ul><li>Optimal proximity: defined on root causes ( RC ) </li></ul><ul><li>Our approach: defined on likely causes ( LC ) </li></ul>F P + + + + X Y 0 = Automated Fault Localization
39.
R-Proximity: An Instantiation with SOBER <ul><li>Likely causes (LCs) are predicate rankings </li></ul>Pred 2 Pred 3 Pred 1 Pred 6 Pred 2 Pred 3 Pred 1 Pred 6 Pred 2 Pred 6 Pred 1 Pred 3 F P A distances between rankings is needed + + + + X Y 0 Pred 2 Pred 6 Pred 1 Pred 3 SOBER
40.
Distance between Rankings <ul><li>Traditional Kendall’s tau distance </li></ul><ul><ul><li>Number of preference disagreements </li></ul></ul><ul><ul><li>E.g. </li></ul></ul><ul><li>NOT all predicates need to be considered? </li></ul><ul><ul><li>Predicates are uniformly instrumented </li></ul></ul><ul><ul><li>Only fault-relevant predicates count </li></ul></ul>
41.
Predicate Weighting in a Nutshell <ul><li>Fault-relevant predicates receive higher weights </li></ul><ul><li>Fault-relevance is implied by rankings </li></ul>Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 1 Pred 3 Pred 6 Pred 2 Pred 1 Pred 3 Pred 6 Mostly favored predicates receive higher weights
42.
Automated Failure Assignment <ul><li>Most-favored predicates indicate the agreed bug location for a group of failures </li></ul><ul><li>Predicate spectrum graph </li></ul>Pred. Index Y 0 1 3 6 2 4 5 2 4 Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 6 Pred 1 Pred 3 Pred 2 Pred 1 Pred 3 Pred 6 Pred 2 Pred 1 Pred 3 Pred 6
43.
Case Study 1: Grep-2.2 <ul><li>470 test cases in total </li></ul><ul><li>136 cases fail due to both faults, no crashes </li></ul><ul><li>48 fail due to Fault 1, 88 fail due to Fault 2 </li></ul>
44.
Failure Proximity Graphs <ul><li>Red crosses are failures due to Fault 1 </li></ul><ul><li>Blue circles are failures due to Fault 2 </li></ul><ul><li>Divergent behaviors due to the same fault </li></ul><ul><li>Better clustering result under R-Proximity </li></ul>T-Proximity R-Proximity
45.
Guided Failure Assignment <ul><li>What predicates are favored in each group? </li></ul>
46.
Assign Failures to Appropriate Developers <ul><li>The 21 failing cases in Cluster 1 are assigned to developers responsible for the function grep </li></ul><ul><li>The 112 failing cases in Cluster 2 are assigned to developers responsible for the function comsub </li></ul>
47.
Case Study 2: Gzip-1.2.3 <ul><li>217 test cases in total </li></ul><ul><li>82 cases fail due to both faults, no crashes </li></ul><ul><li>65 fail due to Fault 1, 17 fail due to Fault 2 </li></ul>
48.
Failure Proximity Graphs <ul><li>Red crosses are for failures due to Fault 1 </li></ul><ul><li>Blue circles are for failures due to Fault 2 </li></ul><ul><li>Nearly perfect clustering under R-Proximity </li></ul><ul><li>Accurate failure assignment </li></ul>T-Proximity R-Proximity
49.
Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
50.
Mining Copy-Paste Bugs <ul><li>Copy-pasting is common </li></ul><ul><ul><li>12% in Linux file system [Kasper2003] </li></ul></ul><ul><ul><li>19% in X Window system [Baker1995] </li></ul></ul><ul><li>Copy-pasted code is error prone </li></ul><ul><ul><li>Among 35 errors in Linux drivers/i2o, 34 are caused by copy-paste [Chou2001] </li></ul></ul>Forget to change! void __init prom_meminit(void) { …… for (i=0; i<n; i++) { total [i].adr = list[i].addr; total [i].bytes = list[i].size; total [i].more = & total [i+1]; } …… for (i=0; i<n; i++) { taken [i].adr = list[i].addr; taken [i].bytes = list[i].size; taken [i].more = & total [i+1]; } for (i=0; i<n; i++) { total [i].adr = list[i].addr; total [i].bytes = list[i].size; total [i].more = & total [i+1]; } (Simplified example from linux-2.6.6/arch/sparc/prom/memory.c )
51.
An Overview of Copy-Paste Bug Detection Parse source code & build a sequence database Mine for basic copy-pasted segments Compose larger copy-pasted segments Prune false positives
52.
Parsing Source Code <ul><li>Purpose: building a sequence database </li></ul><ul><li>Idea: statement number </li></ul><ul><ul><li>Tokenize each component </li></ul></ul><ul><ul><li>Different operators/constant/key words different tokens </li></ul></ul><ul><li>Handle identifier renaming: </li></ul><ul><ul><li>same type of identifiers same token </li></ul></ul>5 61 20 old = 3; Tokenize Hash 16 new = 3; 5 61 20 Hash 16
53.
Building Sequence Database <ul><li>Program a long sequence </li></ul><ul><ul><li>Need a sequence database </li></ul></ul><ul><li>Cut the long sequence </li></ul><ul><ul><li>Naïve method: fixed length </li></ul></ul><ul><ul><li>Our method: basic block </li></ul></ul>for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } …… for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } 65 16 16 71 … 65 16 16 71 Final sequence DB: (65) (16, 16, 71) … (65) (16, 16, 71) Hash values
54.
Mining for Basic Copy-pasted Segments <ul><li>Apply frequent sequence mining algorithm on the sequence database </li></ul><ul><li>Modification </li></ul><ul><ul><li>Constrain the max gap </li></ul></ul>total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; (16, 16, 71) …… (16, 16, 10, 71) (16, 16, 71) …… (16, 16, 71) Frequent subsequence Insert 1 statement ( gap = 1)
56.
Pruning False Positives <ul><li>Unmappable segments </li></ul><ul><ul><li>Identifier names cannot be mapped to corresponding ones </li></ul></ul><ul><li>Tiny segments </li></ul><ul><li>For more detail, see </li></ul><ul><ul><li>Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation , 2004 </li></ul></ul>f (a1); f (a2); f (a3); f1 (b1); f1 (b2); f2 (b3); conflict
57.
Some Test Results of C-P Bug Detection 0 2 PostgreSQL 0 5 Apache 8 23 FreeBSD 21 28 Linux Potential Bugs (careless programming) Verified Bugs Software 458 K PostgreSQL 224 K Apache 3.3 M FreeBSD 4.4 M Linux # LOC Software Space (MB) Time Software 57 38 secs PostgreSQL 30 15 secs Apache 459 20 mins FreeBSD 527 20 mins Linux
58.
Outline <ul><li>Automated Debugging and Failure Triage </li></ul><ul><li>SOBER: Statistical Model-Based Fault Localization </li></ul><ul><li>Fault Localization-Based Failure Triage </li></ul><ul><li>Copy and Paste Bug Mining </li></ul><ul><li>Conclusions & Future Research </li></ul>
59.
Conclusions <ul><li>Data mining into software and computer systems </li></ul><ul><li>Identify incorrect executions from program runtime behaviors </li></ul><ul><li>Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs </li></ul><ul><li>A hypothesis testing-like approach is developed to localize logic bugs in software </li></ul><ul><li>No prior knowledge about the program semantics is assumed </li></ul><ul><li>Lots of other software bug mining methods should be and explored </li></ul>
60.
Future Research: Mining into Computer Systems <ul><li>Huge volume of data from computer systems </li></ul><ul><ul><li>Persistent state interactions, event logs, network logs, CPU usage, … </li></ul></ul><ul><li>Mining system data for … </li></ul><ul><ul><li>Reliability </li></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Manageability </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><li>Challenges in data mining </li></ul><ul><ul><li>Statistical modeling of computer systems </li></ul></ul><ul><ul><li>Online, scalability, interpretability … </li></ul></ul>
61.
References <ul><li>[DRL+98] David L. Detlefs, K. Rustan, M. Leino, Greg Nelson and James B. Saxe. Extended static checking, 1998 </li></ul><ul><li>[EGH+94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint: A tool for using specifications to check code. In Proceedings of the ACM SIG-SOFT '94 Symposium on the Foundations of Software Engineering, pages 87-96, 1994. </li></ul><ul><li>[DLS02] Manuvir Das, Sorin Lerner, and Mark Seigle. Esp: Path-sensitive program verication in polynomial time. In Conference on Programming Language Design and Implementation, 2002. </li></ul><ul><li>[ECC00] D.R. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specic, programmer-written compiler extensions. In Proc. 4th Symp. Operating Systems Design and Implementation, October 2000. </li></ul><ul><li>[M93] Ken McMillan. Symbolic Model Checking. Kluwer Academic Publishers, 1993 </li></ul><ul><li>[H97] Gerard J. Holzmann. The model checker SPIN. Software Engineering, 23(5):279-295, 1997. </li></ul><ul><li>[DDH+92] David L. Dill, Andreas J. Drexler, Alan J. Hu, and C. Han Yang. Protocol verication as a hardware design aid. In IEEE Int. Conf. Computer Design: VLSI in Computers and Processors, pages 522-525, 1992. </li></ul><ul><li>[MPC+02] M. Musuvathi, D. Y.W. Park, A. Chou, D. R. Engler and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. In Proc. 5th Symp. Operating Systems Design and Implementation, 2002. </li></ul>
62.
References (cont’d) <ul><li>[G97] P. Godefroid. Model Checking for Programming Languages using VeriSoft. In Proc. 24th ACM Symp. Principles of Programming Languages, 1997 </li></ul><ul><li>[BHP+-00] G. Brat, K. Havelund, S. Park, and W. Visser. Model checking programs. In IEEE Int.l Conf. Automated Software Engineering (ASE), 2000. </li></ul><ul><li>[HJ92] R. Hastings and B. Joyce. Purify: Fast Detection of Memory Leaks and Access Errors. 1991. in Proc. Winter 1992 USENIX Conference, pp. 125-138. San Francisco, California </li></ul><ul><li>Chao Liu, Xifeng Yan, and Jiawei Han, “ Mining Control Flow Abnormality for Logic Error Isolation, ” in Proc. 2006 SIAM Int. Conf. on Data Mining (SDM'06), Bethesda, MD, April 2006. </li></ul><ul><li>C. Liu, X. Yan, L. Fei, J. Han, and S. Midkiff, “ SOBER: Statistical Model-based Bug Localization ”, in Proc. 2005 ACM SIGSOFT Symp. Foundations of Software Engineering (FSE 2005), Lisbon, Portugal, Sept. 2005. </li></ul><ul><li>C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “ Mining Behavior Graphs for Backtrace of Noncrashing Bugs ”, in Proc. 2005 SIAM Int. Conf. on Data Mining (SDM'05), Newport Beach, CA, April 2005. </li></ul><ul><li>[SN00] Julian Seward and Nick Nethercote. Valgrind, an open-source memory debugger for x86-GNU/Linux http:// valgrind.org / </li></ul><ul><li>[LLM+04] Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 </li></ul><ul><li>[LCS+04] Zhenmin Li, Zhifeng Chen, Sudarshan M. Srinivasan, Yuanyuan Zhou. C-Miner: Mining Block Correlations in Storage Systems. In pro. 3rd USENIX conf. on file and storage technologies, 2004 </li></ul>
64.
Surplus Slides <ul><li>The remaining are leftover slides </li></ul>
65.
Representative Publications <ul><li>Chao Liu , Long Fei, Xifeng Yan, Jiawei Han and Samuel Midkiff, “Statistical Debugging: A Hypothesis Testing-Based Approach,” IEEE Transaction on Software Engineering , Vol. 32, No. 10, pp. 831-848, Oct., 2006. </li></ul><ul><li>Chao Liu and Jiawei Han, “R-Proximity: Failure Proximity Defined via Statistical Debugging,” IEEE Transaction on Software Engineering , Sept. 2006. (under review) </li></ul><ul><li>Chao Liu , Zeng Lian and Jiawei Han, "How Bayesians Debug", the 6th IEEE International Conference on Data Mining, pp. pp. 382-393,Hong Kong, China, Dec. 2006. </li></ul><ul><li>Chao Liu and Jiawei Han, "Failure Proximity: A Fault Localization-Based Approach", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 286-295, Portland, USA, Nov. 2006. </li></ul><ul><li>Chao Liu , "Fault-aware Fingerprinting: Towards Mutualism between Failure Investigation and Statistical Debugging", the 14th ACM SIGSOFT Symposium on the Foundations of Software Engineering, Portland, USA, Nov. 2006. </li></ul><ul><li>Chao Liu , Chen Chen, Jiawei Han and Philip S. Yu, "GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis", the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 872-881, Philadelphia, USA, Aug. 2006. </li></ul><ul><li>Qiaozhu Mei, Chao Liu , Hang Su and Chengxiang Zhai, "A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs", the 15th International Conference on World Wide Web, pp. 533-542, Edinburgh, Scotland, May, 2006. </li></ul><ul><li>Chao Liu , Xifeng Yan and Jiawei Han, "Mining Control Flow Abnormality for Logic Error Isolation", 2006 SIAM International Conference on Data Mining, pp. 106-117, Bethesda, US, April, 2006. </li></ul><ul><li>Chao Liu , Xifeng Yan, Long Fei, Jiawei Han and Samuel Midkiff, "SOBER: Statistical Model-Based Bug Localization", the 5th joint meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 286-295, Lisbon, Portugal, Sept. 2005. </li></ul><ul><li>William Yurcik and Chao Liu . "A First Step Toward Detecting SSH Identity Theft on HPC Clusters: Discriminating Cluster Masqueraders Based on Command Behavior" the 5th International Symposium on Cluster Computing and the Grid, pp. 111-120, Cardiff, UK, May 2005. </li></ul><ul><li>Chao Liu , Xifeng Yan, Hwanjo Yu, Jiawei Han and Philip S. Yu, "Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs", In Proc. 2005 SIAM Int. Conf. on Data Mining, pp. 286-297, Newport Beach, US, April, 2005. </li></ul>
66.
Example of Noncrashing Bugs void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } }
68.
Bug Localization via Backtrace <ul><li>Can we circle out the backtrace for noncrashing bugs? </li></ul><ul><li>Major challenges </li></ul><ul><ul><li>We do not know where abnormality happens </li></ul></ul><ul><li>Observations </li></ul><ul><ul><li>Classifications depend on discriminative features, which can be regarded as a kind of abnormality </li></ul></ul><ul><ul><li>Can we extract backtrace from classification results? </li></ul></ul>
69.
Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
70.
Related Work <ul><li>Crashing bugs </li></ul><ul><ul><li>Memory access monitoring </li></ul></ul><ul><ul><ul><li>Purify [HJ92], Valgrind [SN00] … </li></ul></ul></ul><ul><li>Noncrashing bugs </li></ul><ul><ul><li>Static program analysis </li></ul></ul><ul><ul><li>Traditional model checking </li></ul></ul><ul><ul><li>Model checking source code </li></ul></ul>
71.
Static Program Analysis <ul><li>Methodology </li></ul><ul><ul><li>Examine source code directly </li></ul></ul><ul><ul><li>Enumerate all the possible execution paths without running the program </li></ul></ul><ul><ul><li>Check user-specified properties, e.g. </li></ul></ul><ul><ul><ul><li>free(p) …… (*p) </li></ul></ul></ul><ul><ul><ul><li>lock(res) …… unlock(res) </li></ul></ul></ul><ul><ul><ul><li>receive_ack() … … send_data() </li></ul></ul></ul><ul><li>Strengths </li></ul><ul><ul><li>Check all possible execution paths </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Shallow semantics </li></ul></ul><ul><ul><li>Properties can be directly mapped to source code structure </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>ESC [DRL+98], LCLint [EGH+94], ESP [DLS02], MC Checker [ECC00] … </li></ul></ul>×
72.
Traditional Model Checking <ul><li>Methodology </li></ul><ul><ul><li>Formally model the system under check in a particular description language </li></ul></ul><ul><ul><li>Exhaustive exploration of the reachable states in checking desired or undesired properties </li></ul></ul><ul><li>Strengths </li></ul><ul><ul><li>Model deep semantics </li></ul></ul><ul><ul><li>Naturally fit in checking event-driven systems, like protocols </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Significant amount of manual efforts in modeling </li></ul></ul><ul><ul><li>State space explosion </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>SMV [M93], SPIN [H97], Murphi [DDH+92] … </li></ul></ul>
73.
Model Checking Source Code <ul><li>Methodology </li></ul><ul><ul><li>Run real program in sandbox </li></ul></ul><ul><ul><li>Manipulate event happenings, e.g., </li></ul></ul><ul><ul><ul><li>Message incomings </li></ul></ul></ul><ul><ul><ul><li>the outcomes of memory allocation </li></ul></ul></ul><ul><li>Strengths </li></ul><ul><ul><li>Less significant manual specification </li></ul></ul><ul><li>Problems </li></ul><ul><ul><li>Application restrictions, e.g., </li></ul></ul><ul><ul><ul><li>Event-driven programs (still) </li></ul></ul></ul><ul><ul><ul><li>Clear mapping between source code and logic event </li></ul></ul></ul><ul><li>Tools </li></ul><ul><ul><li>CMC [MPC+02], Verisoft [G97], Java PathFinder [BHP+-00] … </li></ul></ul>
74.
Summary of Related Work <ul><li>In common, </li></ul><ul><ul><li>Semantic inputs are necessary </li></ul></ul><ul><ul><ul><li>Program model </li></ul></ul></ul><ul><ul><ul><li>Properties to check </li></ul></ul></ul><ul><ul><li>Application scenarios </li></ul></ul><ul><ul><ul><li>Shallow semantics </li></ul></ul></ul><ul><ul><ul><li>Event-driven system </li></ul></ul></ul>
75.
Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
76.
Example Revisited void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m > 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } <ul><li>No memory violations </li></ul><ul><li>Not event-driven program </li></ul><ul><li>No explicit error properties </li></ul>
77.
Identification of Incorrect Executions <ul><li>A two-class classification problem </li></ul><ul><ul><li>How to abstract program executions </li></ul></ul><ul><ul><ul><li>Program behavior graph </li></ul></ul></ul><ul><ul><li>Feature selection </li></ul></ul><ul><ul><ul><li>Edges + Closed frequent subgraphs </li></ul></ul></ul><ul><li>Program behavior graphs </li></ul><ul><ul><li>Function-level abstraction of program behaviors </li></ul></ul>int main(){ ... A(); ... B(); } int A(){ ... } int B(){ ... C() ... } int C(){ ... }
78.
Values of Classification <ul><li>A graph classification problem </li></ul><ul><ul><li>Every execution gives one behavior graph </li></ul></ul><ul><ul><li>Two sets of instances: correct and incorrect </li></ul></ul><ul><li>Values of classification </li></ul><ul><ul><li>Classification itself does not readily work for bug localization </li></ul></ul><ul><ul><ul><li>Classifier only labels each run as either correct or incorrect as a whole </li></ul></ul></ul><ul><ul><ul><li>It does not tell when abnormality happens </li></ul></ul></ul><ul><ul><li>Successful classification relies on discriminative features </li></ul></ul><ul><ul><ul><li>Can discriminative features be treated as a kind of abnormality? </li></ul></ul></ul><ul><ul><li>When abnormality happens? </li></ul></ul><ul><ul><ul><li>Incremental classification? </li></ul></ul></ul>?
79.
Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
80.
Incremental Classification <ul><li>Classification works only when instances of two classes are different. </li></ul><ul><li>So that we can use classification accuracy as a measure of difference. </li></ul><ul><li>Relate classification dynamics to bug relevant functions </li></ul>
81.
Illustration: Precision Boost One Correct Execution One Incorrect Execution main main A A B C D B C D E E F G F G H
82.
Bug Relevance <ul><li>Precision boost </li></ul><ul><ul><li>For each function F : </li></ul></ul><ul><ul><ul><li>Precision boost = Exit precision - Entrance precision. </li></ul></ul></ul><ul><ul><li>Intuition </li></ul></ul><ul><ul><ul><li>Differences take place within the execution of F </li></ul></ul></ul><ul><ul><ul><li>Abnormalities happens while F is in the stack </li></ul></ul></ul><ul><ul><ul><li>The larger this precision boost, the more likely F is part of the backtrace </li></ul></ul></ul><ul><li>Bug-relevant function </li></ul>
83.
Outline <ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Case Study </li></ul><ul><li>Conclusions </li></ul>
84.
Case Study <ul><li>Subject program </li></ul><ul><ul><li>replace: perform regular expression matching and substitutions </li></ul></ul><ul><ul><li>563 lines of C code </li></ul></ul><ul><ul><li>17 functions are involved </li></ul></ul><ul><li>Execution behaviors </li></ul><ul><ul><li>130 out of 5542 test cases fail to give correct outputs </li></ul></ul><ul><ul><li>No incorrect executions incur segmentation faults </li></ul></ul><ul><li>Logic bug </li></ul><ul><ul><li>Can we circle out the backtrace for this bug? </li></ul></ul>void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ( (m >= 0) && (lastm != m) ) { putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } }
88.
Method Summary <ul><li>Identify incorrect executions from program runtime behaviors </li></ul><ul><li>Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs </li></ul><ul><li>Data mining can contribute to software engineering and system researches in general </li></ul>
89.
Outline <ul><li>Motivation </li></ul><ul><li>Related Work </li></ul><ul><li>Classification of Program Executions </li></ul><ul><li>Extract “Backtrace” from Classification Dynamics </li></ul><ul><li>Mining Control Flow Abnormality for Logic Error Isolation </li></ul><ul><li>CP-Miner: Mining Copy-Paste Bugs </li></ul><ul><li>Conclusions </li></ul>
90.
An Example <ul><li>Replace program: 563 lines of C code, 20 functions </li></ul><ul><li>Symptom: 30 out of 5542 test cases fail to give correct outputs, and no crashes </li></ul><ul><li>Goal: Localizing the bug, and prioritizing manual examination </li></ul><ul><li>void dodash(char delim, char *src, int *i, char *dest, int *j, int maxset) </li></ul><ul><li>{ </li></ul><ul><li>while (…){ </li></ul><ul><li>… </li></ul><ul><li>if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1] ){ </li></ul><ul><ul><li>for(k = src[*i-1]+1; k<=src[*i+1]; k++) </li></ul></ul><ul><ul><li>junk = addst(k, dest, j, maxset); </li></ul></ul><ul><ul><li>*i = *i + 1; </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>*i = *i + 1; </li></ul></ul><ul><li>} </li></ul><ul><li>} </li></ul>
91.
Difficulty & Expectation <ul><li>Difficulty </li></ul><ul><ul><li>Statically, even small programs are complex due to dependencies </li></ul></ul><ul><ul><li>Dynamically, execution paths can vary significantly across all possible inputs </li></ul></ul><ul><ul><li>Logic errors have no apparent symptoms </li></ul></ul><ul><li>Expectations </li></ul><ul><ul><li>Unrealistic to fully unload developers </li></ul></ul><ul><ul><li>Localize buggy region </li></ul></ul><ul><ul><li>Prioritize manual examination </li></ul></ul>
92.
Execution Profiling <ul><li>Full execution trace </li></ul><ul><ul><li>Control flow + value tags </li></ul></ul><ul><ul><li>Too expensive to record at runtime </li></ul></ul><ul><ul><li>Unwieldy to process </li></ul></ul><ul><li>Summarized control flow for conditionals (if, while, for) </li></ul><ul><ul><li>Branch evaluation counts </li></ul></ul><ul><ul><li>Lightweight to take at runtime </li></ul></ul><ul><ul><li>Easy to process and effective </li></ul></ul>
93.
Analysis of the Example <ul><ul><li>A = isalnum(isalnum(src[*i+1])) </li></ul></ul><ul><ul><li>B = src[*i-1]<=src[*i+1] </li></ul></ul><ul><li>An execution is logically correct until (A ^ ¬B) is evaluated as true when the evaluation reaches this condition </li></ul><ul><li>If we monitor the program conditionals like A here, their evaluation will shed light on the hidden error and can be exploited for error isolation </li></ul><ul><li>if(isalnum(isalnum(src[*i+1]) && src[*i-1]<=src[*i+1]){ </li></ul><ul><ul><li>for(k = src[*i-1]+1; k<=src[*i+1]; k++) </li></ul></ul><ul><ul><li>junk = addst(k, dest, j, maxset); </li></ul></ul><ul><ul><li>*i = *i + 1; } </li></ul></ul>
94.
Analysis of Branching Actions <ul><li>Correct vs. in correct runs in program P </li></ul><ul><li>AS we tested through 5542 test cases, the true eval prob for (A^¬B) is 0.727 in a correct and 0.896 in an incorrect execution on average </li></ul><ul><li>Error location does exhibit detectable abnormal behaviors in incorrect executions </li></ul>n ¬A¬B n A¬B = 0 ¬B n ¬AB n AB B ¬A A n ¬A¬B n A¬B ≥1 ¬B n ¬AB n AB B ¬A A
95.
Conditional Test Works for Nonbranching Errors <ul><li>Off-by-one error can still be detected using the conditional tests </li></ul>Void makepat (char *arg, int start, char delim, char *pat) { … if (!junk) result = 0; else result = i + 1; /* off-by-one error */ /* should be: result = i */ return result; }
96.
Ranking Based on Boolean Bias <ul><li>Let input d i has a desired output o i . We execute P. P passes the test iff o i ’ is identical to o i </li></ul><ul><ul><li>T p = {t i | o i ’= P (d i ) matches o i } </li></ul></ul><ul><ul><li>T f = {t i | o i ’= P (d i ) does not match o i } </li></ul></ul><ul><li>Boolean bias: </li></ul><ul><ul><li>n t : # times that a boolean feature B evaluates true, similar for n f </li></ul></ul><ul><ul><li>Boolean bias: π (B) = (n t – n f )/(n t + n f ) </li></ul></ul><ul><ul><li>It encodes the distribution of B’s value: 1 if B always assumes true, -1 if always false, in between for all the other mixtures </li></ul></ul>
97.
Evaluation Abnormality <ul><li>Boolean bias for branch P </li></ul><ul><ul><li>the probability of being evaluated as true within one execution </li></ul></ul><ul><li>Suppose we have n correct and m incorrect executions, for any predicate P , we end up with </li></ul><ul><ul><li>An observation sequence for correct runs </li></ul></ul><ul><ul><ul><li>S_p = (X’_1, X’_2, …, X’_n) </li></ul></ul></ul><ul><ul><li>An observation sequence for incorrect runs </li></ul></ul><ul><ul><ul><li>S_f = (X_1, X_2, …, X_m) </li></ul></ul></ul><ul><li>Can we infer whether P is suspicious based on S_p and S_f ? </li></ul>
98.
Underlying Populations <ul><li>Imagine the underlying distribution of boolean bias for correct and incorrect executions are f(X| θ p ) and f(X| θ f ) </li></ul><ul><li>S_p and S_f can be viewed as random sample from the underlying populations respectively </li></ul><ul><li>Major heuristic: The larger the divergence between f(X| θ p ) and f(X| θ f ), the more relevant the branch P is to the bug </li></ul>0 1 Prob Evaluation bias 0 1 Prob Evaluation bias
99.
Major Challenges <ul><li>No knowledge of the closed forms of both distributions </li></ul><ul><li>Usually, we do not have sufficient incorrect executions to estimate f(X| θ f ) reliably. </li></ul>0 1 Prob Evaluation bias 0 1 Prob Evaluation bias
101.
Faulty Functions <ul><li>Motivation </li></ul><ul><ul><li>Bugs are not necessarily on branches </li></ul></ul><ul><ul><li>Higher confidence in function rankings than branch rankings </li></ul></ul><ul><li>Abnormality score for functions </li></ul><ul><ul><li>Calculate the abnormality score for each branch within each function </li></ul></ul><ul><ul><li>Aggregate them </li></ul></ul>
102.
Two Evaluation Measures <ul><li>CombineRank </li></ul><ul><ul><li>Combine these score by summation </li></ul></ul><ul><ul><li>Intuition: When a function contains many abnormal branches, it is likely bug-relevant </li></ul></ul><ul><li>UpperRank </li></ul><ul><ul><li>Choose the largest score as the representative </li></ul></ul><ul><ul><li>Intuition: When a function has one extremely abnormal branch, it is likely bug-relevant </li></ul></ul>
103.
Dodash vs. Omatch: Which function is likely buggy?─And Which Measure is More Effective?
104.
Bug Benchmark <ul><li>Bug benchmark </li></ul><ul><ul><li>Siemens Program Suite </li></ul></ul><ul><ul><ul><li>89 variants of 6 subject programs, each of 200-600 LOC </li></ul></ul></ul><ul><ul><ul><li>89 known bugs in total </li></ul></ul></ul><ul><ul><ul><li>Mainly logic (or semantic) bugs </li></ul></ul></ul><ul><ul><li>Widely used in software engineering research </li></ul></ul>
108.
More Questions to Be Answered <ul><li>What will happen (i.e., how to handle) if multiple errors exist in one program? </li></ul><ul><li>How to detect bugs if only very few error test cases are available? </li></ul><ul><li>Is it really more effective if we have more execution traces? </li></ul><ul><li>How to integrate program semantics in this statistics-based testing algorithm? </li></ul><ul><li>How to integrate program semantics analysis with statistics-based analysis? </li></ul>
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment