CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
1. CrashLocator: Locating Crashing
Faults Based on Crash Stacks
Rongxin Wu1, Hongyu Zhang2,
Shing-Chi Cheung1 and Sunghun Kim1
The Hong Kong University of Science and Technology1
Microsoft Research2
July 24th , 2014
ISSTA 2014
3. Feedbacks From Mozilla
Developers
• Locating crashing faults is hard
• Ad hoc approach
“… and look at the crash stack listed. It shows the line number
of the code, and then I go to the code and inspect it. If I am
unsure what it does I go to the second line of the stack and
code and inspect that, and so on and so forth …”
“Some crashes are hard to fix because it is not necessarily
indicative of the place where it crashes in the crash stack …”
“ I use the top down method of following the crash backwards.”
“Sometimes it can be very difficult.”
3
4. Uncertain Fault Location
• The faulty function may not appear in crash stack
About 33%~41% of crashing faults in Firefox
cannot be located in crash stacks!
A
B
C E F G
H
Buggy Code
D
Crash Stack
Crash Point
4
5. • Related Work
• Tarantula
(J. A. Jones et al., ICSE 2002)
(J. A. Jones et al., ASE 2005)
• Jaccard
(R. Abreu et al., TAICPART-MUTATION 2007)
• Ochiai
(R. Abreu et al., TAICPART-MUTATION 2007)
(S. Art et al., ISSTA 2010)
• …
• Passing Traces and Failing Traces
Spectrum-Based Fault Localization
5
6. • Are these techniques applicable?
Spectrum-Based Fault Localization
Instrumented
Product Software
Failing Traces
Passing Traces
Privacy Concern
Performance Overhead
(C. Luk et al., PLDI 2005)
x
Crash Stack
f1
f2
f3
…
fn
x
Test Cases
Effectiveness
(S. Artzi et al., ISSTA’10)
6
7. Our Research Goal
How to help developers fix crashing faults?
– Locate crashing faults based on crash stack
7
8. Our technique: CrashLocator
• Target at locating faulty functions
• No instrumentation needed
• Approximate failing traces
Based on Crash Stacks
Use static analysis techniques
• Rank suspicious functions
Without passing traces
Based on characteristics of faulty functions
8
9. Approximate Failing Traces
• Basic Stack Expansion Algorithm
A
B
C
D
Crash Stack
E
J
M
N
Depth-1
F
K
L
Depth-2
G
H
Depth-3 A
B J
C K L
D E
M N F
G H
Call Graph
9
10. functionposition File Line
D0 file_0 l0
C1 file_1 l1
B2 file_2 l2
A3 file_3 l3
Crash Stack
Approximate Failing Traces
• Basic Stack Expansion Algorithm
Function call information only
• Improved Stack Expansion Algorithm
Source file position information
10
11. Improved Stack Expansion
Algorithm
• Control Flow Analysis
if
J()
…
B()
…
Entry
Exit
In Crash Stack
CFG of A
A
B
C
D
Crash Stack
E
J
M
N
Depth-1
F
K
L
Depth-2
G
H
Depth-3
11
12. Improved Stack Expansion
Algorithm
• Backward Slicing
1. Obj D(){
2. Obj s;
3. int a = M();
4. char b = ‘’;
5. Obj[] c = N(b);
6. s=c[1]; //crash here
7. if(s!=‘’){
8. …
9. }
8. …
9. }
variables {s,c}
A
B
C
D
Crash Stack
E
M
N
Depth-1
F
Depth-2
G
H
Depth-3
Not in slicing
12
13. After crash stack expansion, there are still
a large number of suspicious functions
How to rank the suspicious functions?
13
14. Rank suspicious functions
• An empirical study on the characteristics of faulty
functions
• Quantify the suspiciousness of suspicious functions
14
15. Observation 1:
Frequent Function
• Faulty functions appear frequently in the crash
traces of the corresponding buckets.
Function Frequency (FF)
Crash
Report More Frequent,
More Suspicious
For 89-92% crashing faults, the associated faulty functions
appear in all crash execution traces in the corresponding bucket.
Crash Bucket
15
16. Frequent Function
• Some frequent functions are unlikely to be buggy
Entry points (main, _RtlUserThreadStart, …)
Event handling routine (CloseHandle)
• Information retrieval, some frequent words are useless
stop-words, e.g. “the”, “an”, “a”
Inverse Document Frequency (IDF)
• Inverse Bucket Frequency (IBF)
If a function appears in many buckets, it is less likely to be
buggy
16
17. Observation 2:
Functions Close to Crash Point
• Faulty functions appear closer to crash point
In Mozilla Firefox, for 84.3% of crashing faults, the
distance between crash point and the associated faulty
functions is less 5.
• Inverse Average Distance to Crash Point (IAD)
17
18. Observation 3:
Less Frequently Changed Functions
• Functions that do not contain crashing faults are
often less frequently changed
94.1% of faulty functions have been changed at least
once during the past 12 months
Immune Functions (Y. Dang et al. ICSE 2012)
• Less frequently changed functions
Functions that have no changes in past 12 months
Suspicious score is 0
18
19. Observation 4: Large Functions
• Our prior study (H. Zhang. ICSM 2009) showed that
large modules are more likely to be defect-prone
• Function’s Lines of Code (FLOC)
19
21. Evaluation Subjects
• Mozilla Products
5 releases of Firefox
2 releases of Thunderbird
1 release of SeaMonkey
• 160 crashing faults(buckets)
• Large-Scale
More than 2 million LOC
More than 120K functions
21
22. Evaluation Metrics
• Recall@N: Percentage of successfully located faults
by examining top N recommended functions
• Mean Reciprocal Rank (MRR)
Measure the quality of the ranking results in IR
Range value: 0 ~ 1
Higher value means better ranking
22
23. Experimental Design
• RQ1: How many faults can be successfully located by
CrashLocator?
• RQ2: Can CrashLocator outperform the conventional
stack-only methods?
• RQ3: How does each factor contribute to the crash
localization performance?
• RQ4: How effective is the proposed crash stack
expansion algorithm?
23
26. RQ2: Comparison with Stack-Only
methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 5 10 20 50 100
Recall@N
Top N Functions
StackOnlySampling
StackOnlyAverage
StackOnlyChangeDate
CrashLocator
26
27. RQ3: Contribution of Each Factors
• Inverse Bucket Frequency (IBF)
• Function Frequency (FF)
• Function’s Lines of Code (FLOC)
• Inverse Average Distance to Crash Point (IAD)
27
31. Conclusions
• Propose a novel technique CrashLocator to locate
crashing faults based on crash stack only
• Evaluate on real and large-scale projects
• 50.6%, 63.7%, and 67.5% of crashing faults can be
located by examining only top 1,5,10 functions
• CrashLocator outperforms Stack-Only methods
significantly, with the improvement of MRR at least
32% and the improvement of Recall@10 at least 23%
31
Editor's Notes
Good afternoon. thanks for joining this presentation.
My name is …
Today, I am going to present…
This work is a joint work between …
Let me start to introduce it.
As we know, software crash is common. Crash is a severe manifestation of faults. Due to the importance and severity of crash, recent years, some industrial companies and open source communities developed crash reporting systems to collect crash reports from end users.
Due to the large number of users, there are many crash reports received daily. It is impossible for developers to inspect each of them.
Therefore, crash reporting systems will organize the crash reports. The organizing process is also called crash bucketing, which group the crash reports caused by the same bugs together. Then, bug reports are generated based on the crash buckets and sent to developers for debugging.
Although the crash reporting systems have been proved to be useful in debugging, still debugging crashing faults is not easy.
After communicating with Mozilla developers, we found that, sometimes locating crashing fault is hard. Especially, when they cannot directly get evidences from the crash stack, the crashing fault become difficult.
To fix crashing bug, they usually used ad hoc approach. Usually, they used top down method to inspect crash stack.
Crash stack is useful. However, using only crash stack is insufficient.
We conduct an empirical study in 3 release versions of Firefox.
We find that the buggy code may not always appear in crash stack.
This is because, the buggy code may be executed and popped out of call stack. Then, the side effect of buggy code is taken in the later executed statements.
In Mozilla, 33%-41% of crashing faults cannot be located in crash stacks.
Then, we consider the fault localization techniques to assist the debugging.
In recent years, many spectrum-based fault localization techniques are proposed, such as Tarantula, Jaccard, Ochiai.
These techniques contrast passing and failing execution traces, and compute the suspicious scores of program elements, and present the ranked list of program elements to developers.
These techniques are well studied. However, are these techniques directly applicable?
As we know, the passing and failing traces are required by these techniques.
To obtain these traces, usually they need to instrument the programs to collect.
However, in production software, usually instrumentation is not allowed, due to the privacy concern and the performance overhead caused by the instrumentation.
Therefore, we cannot obtain the traces from end users.
For failing trace, what we have now is crash stack. Crash stack is a snapshot of call stack at the time of crashing. It is a partial execution trace and is not equivalent to complete failing trace.
For passing trace, we may be able to obtain via the test cases. However, the study by S. Artzi showed that, the fault localization techniques are effective, if the passing traces are similar to the failing traces. It is not always possible to have such test cases to generate passing traces similar to failing ones.
First, the instrumenting production software to collect full trace is usually not available. This is mainly because of the privacy concerns of end users as well as the performance overhead caused by the instrumentation. We noticed that, recently some researches in our community have proposed some low-overhead instrumentation technologies to profile the dynamic behavior of software. However, before the wide-adoption of these new instrumentation technique, instrumenting production software to collect full trace is still not available.
Except the instrumentation, are we able to obtain the execution traces?
For the failing traces, since the crash stack is a call stack information at the time of crashing, it is a partial failing trace.
For the passing traces, although we can get passing traces from existing test cases, we cannot guarantee the effectiveness of these test cases. Some studies show that, leverage passing tests whose characteristics are similar to the failing trace can achieve effective fault localization performance.
As such, due to these limitations, conventional fault localization techniques may not be directly applicable in current step.
Then, with only crash reports available in crash reporting system, how we can help developers fix crashing faults?
We propose our research goal: to locate crashing faults based on crash stacks.
Our technique is named as CrashLocator, which aims at locating faulty functions, because functions are commonly used in unit testing and helpful for crash reproducing.
Different from conventional fault localization, our technique does not need any instrumentation.
CrashLocator contains two major steps. The first step is approximating failing traces, the second is ranking suspicious functions.
The first step is to approximate the failing traces. This is because faulty functions may not reside on crash stack. In this step, we use static analysis to generate the failing traces based o crash stacks.
The second step is to rank the suspicious functions. This is because the number of suspicious functions after approximating can be very large and we need to prioritize the list. In this step, we do not use passing traces. Instead, the ranking is based on the characteristics of faulty functions.
Let us see the detail of our technique.
To approximate the failing traces, a simple way is to expand the crash stack via call graph information.
For example, we have a crash stack and call graph at the beginning. For the function A, there are two callee function B and J. B is in crash stack, J is not in crash stack but can be possibly executed before crash. Therefore, we include J into failing traces. Similarly, we can do this for the function C and D in crash stack. As such, we can include JEMN in our failing traces in the call depth 1.
We can further expanding the failing traces by analysis the functions that can be executed by JEMN. As such, we can include the function KLF in the failing traces. By expanding crash stack in different call depths, we can approximate the failing traces.
The basic stack expansion algorithm is simple and conservative. It only use the function call information in crash stack.
However, we find that crash stack contains more information, such as the source file position information.
Therefore, we proposed the improved stack expansion algorithm based on it.
To reduce the functions that are impossible to be executed before crash, we conduct control flow analysis on each function in crash stack.
For example, we first get the cfg of function A.
We find this position is in crash stack. so we can infer the possible control flow path and J is not in the path. Then, we can filter the function call J out. In the stack expansion steps, we will not consider to expand the function call J from A.
In our study, we find that, usually, the variables in crash lines are related to crashes. Then, we perform backward slicing to get the statements that can affect the crash-related variables.
For example, in function D, Line 6 is crash line. Crash related variables are s and c. Via backward slicing, we find that, Line 3 is not in the slicing statements. The function call to M will not affect the s and c. Therefore, we filter out the function call to M from D in our expansion steps.
Based on control flow analysis and backward slicing, we can approximate a comparable precise failing traces.
Let us see the first observation.
A crashing bug may trigger a bucket of crash reports.
The crash stacks in these reports may be different, since a single fault may manifest in different ways due to different configuration and platforms.
Intuitively, the faulty functions should appear frequently in the failing traces in these crash reports.
Our empirical study showed that, 89-92% of crashing faults, the associated functions appear in all crash execution traces in the corresponding bucket.
We conclude this result as our first observation. Faulty functions appear frequently in crash traces of the corresponding buckets.
Then we propose our first factor function frequency to characterize the faulty function.
However, some functions appear frequently but are unlikely to be buggy, e.g. the entry points and some event handling routines.
This is similar to the concept of “stop-words” in information retrieval. The words like “a”, “an” “the” appear frequently but contain less meaning. Therefore, to decrease the weight of these words, inverse document frequency will be used.
We adopt the similar concept, and generate our second factor Inverse Bucket Frequency to decrease the priority of the frequent functions that are across many buckets.
We also find that, in Mozilla, for 84.3% of crashing faults, the distance between faulty function and crashing point is very close.
W summarize this studying result as our second observation. Based on that, we propose our third factors “Inverse average distance ” (IAD). IAD gives high priority to the functions closer to crash point.
Our empirical study also showed that, 94.1% of faulty functions have been changed at least once during the past 12 months. This result is consistent with our previous study in Microsoft. In that work, we find the existence of immune functions. Immune functions are a list of functions that are considered to be unlikely to be buggy. One category of immune functions are those functions have been successfully used for quite a long time without changes.
Therefore, we summarize our third observation as Functions that do not contain crashing faults are often less frequently changed.
Using this observation, we select the functions that have no changes in past 12 months and assign 0 as the suspicious score for them.
In our prior study, we find that a large modules are more likely to be buggy. Therefore, we design the fourth factor Function’s Lines of Code.
Based on the four factors, we design the suspicious score as multiplying all of factors.
Based on the suspicious score, we rank the functions in approximated traces.
For the evaluation, we select Mozilla three products as our evaluation subjects.
In total, there are 160 crashing buckets. The programming language is C/C++.
All the subjects are large-scale.
We use Recall@N and MRR as evaluation metrics.
Recall@N measures the percentage of the bugs can be located by examining top N recommended functions.
MRR is a widely-used metrics to measure the quality of ranking results in IR. Its value ranges from 0 to 1. The higher value of MRR means a better ranking result.
We design four research questions.
RQ1 evaluates the performance of our approach.
RQ2 compares our approach with the baseline approach named as stack-only methods. The stack-only methods are originated from the Mozilla developer’s feedback.
RQ3 evaluates the contribution of each factor.
RQ4 evaluates the effectiveness of our proposed crash stack expansion algorithm by comparing with basic stack expansion algorithm.
The table shows the evaluation on RQ1.
For each product, we showed the metrics of Recall@1 Recall@5 and Recall@10, as well as MRR.
Take firefox 4.0b4 as an example, Recall@1 is 55.6%, that means only examining the top 1 recommended function, we can locate 55.6% of crashing faults. Similarly, by examining top 5 functions, we can locate 66.7% of faults, by examining top 10 functions, we can locate 77.8% of faults. The MRR value is 0.627.
Overall, by examining top 1 functions, we can locate 50.6% of faults.
For RQ2, we compare with the baseline approaches, that is stack-only methods.
With the feedback from Mozilla developers, they usually inspected the functions in crash stack for debugging. Then, we design three variants of stack-only approaches.
In StackOnlySampling method, for each bucket, we randomly select one crash from the bucket, rank the functions based on their position in crash stack.
In StackOnlyAverage method, for each bucket, we select all the crashes from the bucket, rank the functions based on their average position in crash stack.
In StackOnlyChangeDate method, for each bucket, we randomly select one crash from the bucket, and rank the functions based on the last modified date of the functions.
The figure shows the comparison results.
The X axis is the number of functions we examined in the recommendation list.
The Y axis is the Recall@N metric.
As we can see, CrashLocator outperforms all the other approaches.
For example, by examining top 1 functions, CrashLocator can locate 50.6% of faults, while the second best approach StackOnlyAverage can only locate 35.6% of faults.
In terms of Recall@1, the improvement of CrashLocator over StackOnlyAverage is 42%.
Similarly, in terms of Recall@10, improvement is ranging from 23.2% to 45.8%.
In RQ3, we evaluate the contribution of the four proposed factors, IBF, FF, FLOC, and IAD.
This figure shows the performance of crashlocator by incrementally applying IBF, FF, FLOC and IAD factor, in terms of MRR metric.
When only applying IBF, the performance is lowest, e.g. the overall MRR is about 0.1, by incrementally adding FF and FLOC factors, the performance is improved.
When all factors are considered, the performance is the best.
Therefore, we can know that, each factor can contribute to the performance, and IAD factor has more significant contributions than other factors.
RQ4, we evaluate the effectiveness of our proposed stack expansion algorithm by comparing with the basic one which only uses static call graph.
This figure shows the comparison between two stack expansion algorithms in terms of Recall@1, Recall@5, Recall@10, Recall@20, Recall@50 and MRR.
in terms of Recall@N, the improvement of the proposed expansion algorithm over the basic one is ranging from 13.3% to 72.3%.
In terms of MRR, the improvement is 59.3%.