CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)


Published on

ISSTA 2014 Presentation.
A winner of the ACM SIGSOFT Distinguished Paper Award.

Published in: Software
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Good afternoon. thanks for joining this presentation.
    My name is …
    Today, I am going to present…
    This work is a joint work between …

    Let me start to introduce it.
  • As we know, software crash is common. Crash is a severe manifestation of faults. Due to the importance and severity of crash, recent years, some industrial companies and open source communities developed crash reporting systems to collect crash reports from end users.
    Due to the large number of users, there are many crash reports received daily. It is impossible for developers to inspect each of them.
    Therefore, crash reporting systems will organize the crash reports. The organizing process is also called crash bucketing, which group the crash reports caused by the same bugs together. Then, bug reports are generated based on the crash buckets and sent to developers for debugging.
  • Although the crash reporting systems have been proved to be useful in debugging, still debugging crashing faults is not easy.
    After communicating with Mozilla developers, we found that, sometimes locating crashing fault is hard. Especially, when they cannot directly get evidences from the crash stack, the crashing fault become difficult.

    To fix crashing bug, they usually used ad hoc approach. Usually, they used top down method to inspect crash stack.

    Crash stack is useful. However, using only crash stack is insufficient.
  • We conduct an empirical study in 3 release versions of Firefox.
    We find that the buggy code may not always appear in crash stack.
    This is because, the buggy code may be executed and popped out of call stack. Then, the side effect of buggy code is taken in the later executed statements.

    In Mozilla, 33%-41% of crashing faults cannot be located in crash stacks.
  • Then, we consider the fault localization techniques to assist the debugging.
    In recent years, many spectrum-based fault localization techniques are proposed, such as Tarantula, Jaccard, Ochiai.
    These techniques contrast passing and failing execution traces, and compute the suspicious scores of program elements, and present the ranked list of program elements to developers.
  • These techniques are well studied. However, are these techniques directly applicable?

    As we know, the passing and failing traces are required by these techniques.
    To obtain these traces, usually they need to instrument the programs to collect.
    However, in production software, usually instrumentation is not allowed, due to the privacy concern and the performance overhead caused by the instrumentation.
    Therefore, we cannot obtain the traces from end users.

    For failing trace, what we have now is crash stack. Crash stack is a snapshot of call stack at the time of crashing. It is a partial execution trace and is not equivalent to complete failing trace.

    For passing trace, we may be able to obtain via the test cases. However, the study by S. Artzi showed that, the fault localization techniques are effective, if the passing traces are similar to the failing traces. It is not always possible to have such test cases to generate passing traces similar to failing ones.

    First, the instrumenting production software to collect full trace is usually not available. This is mainly because of the privacy concerns of end users as well as the performance overhead caused by the instrumentation. We noticed that, recently some researches in our community have proposed some low-overhead instrumentation technologies to profile the dynamic behavior of software. However, before the wide-adoption of these new instrumentation technique, instrumenting production software to collect full trace is still not available.

    Except the instrumentation, are we able to obtain the execution traces?

    For the failing traces, since the crash stack is a call stack information at the time of crashing, it is a partial failing trace.

    For the passing traces, although we can get passing traces from existing test cases, we cannot guarantee the effectiveness of these test cases. Some studies show that, leverage passing tests whose characteristics are similar to the failing trace can achieve effective fault localization performance.

    As such, due to these limitations, conventional fault localization techniques may not be directly applicable in current step.

  • Then, with only crash reports available in crash reporting system, how we can help developers fix crashing faults?

    We propose our research goal: to locate crashing faults based on crash stacks.
  • Our technique is named as CrashLocator, which aims at locating faulty functions, because functions are commonly used in unit testing and helpful for crash reproducing.

    Different from conventional fault localization, our technique does not need any instrumentation.

    CrashLocator contains two major steps. The first step is approximating failing traces, the second is ranking suspicious functions.

    The first step is to approximate the failing traces. This is because faulty functions may not reside on crash stack. In this step, we use static analysis to generate the failing traces based o crash stacks.

    The second step is to rank the suspicious functions. This is because the number of suspicious functions after approximating can be very large and we need to prioritize the list. In this step, we do not use passing traces. Instead, the ranking is based on the characteristics of faulty functions.
  • Let us see the detail of our technique.
    To approximate the failing traces, a simple way is to expand the crash stack via call graph information.
    For example, we have a crash stack and call graph at the beginning. For the function A, there are two callee function B and J. B is in crash stack, J is not in crash stack but can be possibly executed before crash. Therefore, we include J into failing traces. Similarly, we can do this for the function C and D in crash stack. As such, we can include JEMN in our failing traces in the call depth 1.

    We can further expanding the failing traces by analysis the functions that can be executed by JEMN. As such, we can include the function KLF in the failing traces. By expanding crash stack in different call depths, we can approximate the failing traces.

  • The basic stack expansion algorithm is simple and conservative. It only use the function call information in crash stack.

    However, we find that crash stack contains more information, such as the source file position information.
    Therefore, we proposed the improved stack expansion algorithm based on it.
  • To reduce the functions that are impossible to be executed before crash, we conduct control flow analysis on each function in crash stack.
    For example, we first get the cfg of function A.
    We find this position is in crash stack. so we can infer the possible control flow path and J is not in the path. Then, we can filter the function call J out. In the stack expansion steps, we will not consider to expand the function call J from A.
  • In our study, we find that, usually, the variables in crash lines are related to crashes. Then, we perform backward slicing to get the statements that can affect the crash-related variables.

    For example, in function D, Line 6 is crash line. Crash related variables are s and c. Via backward slicing, we find that, Line 3 is not in the slicing statements. The function call to M will not affect the s and c. Therefore, we filter out the function call to M from D in our expansion steps.

    Based on control flow analysis and backward slicing, we can approximate a comparable precise failing traces.
  • Let us see the first observation.
    A crashing bug may trigger a bucket of crash reports.
    The crash stacks in these reports may be different, since a single fault may manifest in different ways due to different configuration and platforms.
    Intuitively, the faulty functions should appear frequently in the failing traces in these crash reports.
    Our empirical study showed that, 89-92% of crashing faults, the associated functions appear in all crash execution traces in the corresponding bucket.
    We conclude this result as our first observation. Faulty functions appear frequently in crash traces of the corresponding buckets.
    Then we propose our first factor function frequency to characterize the faulty function.
  • However, some functions appear frequently but are unlikely to be buggy, e.g. the entry points and some event handling routines.

    This is similar to the concept of “stop-words” in information retrieval. The words like “a”, “an” “the” appear frequently but contain less meaning. Therefore, to decrease the weight of these words, inverse document frequency will be used.

    We adopt the similar concept, and generate our second factor Inverse Bucket Frequency to decrease the priority of the frequent functions that are across many buckets.
  • We also find that, in Mozilla, for 84.3% of crashing faults, the distance between faulty function and crashing point is very close.

    W summarize this studying result as our second observation. Based on that, we propose our third factors “Inverse average distance ” (IAD). IAD gives high priority to the functions closer to crash point.
  • Our empirical study also showed that, 94.1% of faulty functions have been changed at least once during the past 12 months. This result is consistent with our previous study in Microsoft. In that work, we find the existence of immune functions. Immune functions are a list of functions that are considered to be unlikely to be buggy. One category of immune functions are those functions have been successfully used for quite a long time without changes.
    Therefore, we summarize our third observation as Functions that do not contain crashing faults are often less frequently changed.

    Using this observation, we select the functions that have no changes in past 12 months and assign 0 as the suspicious score for them.
  • In our prior study, we find that a large modules are more likely to be buggy. Therefore, we design the fourth factor Function’s Lines of Code.
  • Based on the four factors, we design the suspicious score as multiplying all of factors.

    Based on the suspicious score, we rank the functions in approximated traces.
  • For the evaluation, we select Mozilla three products as our evaluation subjects.
    In total, there are 160 crashing buckets. The programming language is C/C++.
    All the subjects are large-scale.
  • We use Recall@N and MRR as evaluation metrics.
    Recall@N measures the percentage of the bugs can be located by examining top N recommended functions.

    MRR is a widely-used metrics to measure the quality of ranking results in IR. Its value ranges from 0 to 1. The higher value of MRR means a better ranking result.
  • We design four research questions.
    RQ1 evaluates the performance of our approach.
    RQ2 compares our approach with the baseline approach named as stack-only methods. The stack-only methods are originated from the Mozilla developer’s feedback.
    RQ3 evaluates the contribution of each factor.
    RQ4 evaluates the effectiveness of our proposed crash stack expansion algorithm by comparing with basic stack expansion algorithm.
  • The table shows the evaluation on RQ1.
    For each product, we showed the metrics of Recall@1 Recall@5 and Recall@10, as well as MRR.
    Take firefox 4.0b4 as an example, Recall@1 is 55.6%, that means only examining the top 1 recommended function, we can locate 55.6% of crashing faults. Similarly, by examining top 5 functions, we can locate 66.7% of faults, by examining top 10 functions, we can locate 77.8% of faults. The MRR value is 0.627.

    Overall, by examining top 1 functions, we can locate 50.6% of faults.
  • For RQ2, we compare with the baseline approaches, that is stack-only methods.

    With the feedback from Mozilla developers, they usually inspected the functions in crash stack for debugging. Then, we design three variants of stack-only approaches.

    In StackOnlySampling method, for each bucket, we randomly select one crash from the bucket, rank the functions based on their position in crash stack.

    In StackOnlyAverage method, for each bucket, we select all the crashes from the bucket, rank the functions based on their average position in crash stack.

    In StackOnlyChangeDate method, for each bucket, we randomly select one crash from the bucket, and rank the functions based on the last modified date of the functions.

  • The figure shows the comparison results.
    The X axis is the number of functions we examined in the recommendation list.
    The Y axis is the Recall@N metric.

    As we can see, CrashLocator outperforms all the other approaches.
    For example, by examining top 1 functions, CrashLocator can locate 50.6% of faults, while the second best approach StackOnlyAverage can only locate 35.6% of faults.
    In terms of Recall@1, the improvement of CrashLocator over StackOnlyAverage is 42%.
    Similarly, in terms of Recall@10, improvement is ranging from 23.2% to 45.8%.
  • In RQ3, we evaluate the contribution of the four proposed factors, IBF, FF, FLOC, and IAD.
  • This figure shows the performance of crashlocator by incrementally applying IBF, FF, FLOC and IAD factor, in terms of MRR metric.
    When only applying IBF, the performance is lowest, e.g. the overall MRR is about 0.1, by incrementally adding FF and FLOC factors, the performance is improved.
    When all factors are considered, the performance is the best.

    Therefore, we can know that, each factor can contribute to the performance, and IAD factor has more significant contributions than other factors.

  • RQ4, we evaluate the effectiveness of our proposed stack expansion algorithm by comparing with the basic one which only uses static call graph.
  • This figure shows the comparison between two stack expansion algorithms in terms of Recall@1, Recall@5, Recall@10, Recall@20, Recall@50 and MRR.
    in terms of Recall@N, the improvement of the proposed expansion algorithm over the basic one is ranging from 13.3% to 72.3%.
    In terms of MRR, the improvement is 59.3%.
  • CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)

    1. 1. CrashLocator: Locating Crashing Faults Based on Crash Stacks Rongxin Wu1, Hongyu Zhang2, Shing-Chi Cheung1 and Sunghun Kim1 The Hong Kong University of Science and Technology1 Microsoft Research2 July 24th , 2014 ISSTA 2014
    2. 2. Background 2 Crash Information with Crash Stack Crash Reporting System Software Crash Bug ReportsDevelopers Crash Buckets
    3. 3. Feedbacks From Mozilla Developers • Locating crashing faults is hard • Ad hoc approach “… and look at the crash stack listed. It shows the line number of the code, and then I go to the code and inspect it. If I am unsure what it does I go to the second line of the stack and code and inspect that, and so on and so forth …” “Some crashes are hard to fix because it is not necessarily indicative of the place where it crashes in the crash stack …” “ I use the top down method of following the crash backwards.” “Sometimes it can be very difficult.” 3
    4. 4. Uncertain Fault Location • The faulty function may not appear in crash stack About 33%~41% of crashing faults in Firefox cannot be located in crash stacks! A B C E F G H Buggy Code D Crash Stack Crash Point 4
    5. 5. • Related Work • Tarantula (J. A. Jones et al., ICSE 2002) (J. A. Jones et al., ASE 2005) • Jaccard (R. Abreu et al., TAICPART-MUTATION 2007) • Ochiai (R. Abreu et al., TAICPART-MUTATION 2007) (S. Art et al., ISSTA 2010) • … • Passing Traces and Failing Traces Spectrum-Based Fault Localization 5
    6. 6. • Are these techniques applicable? Spectrum-Based Fault Localization Instrumented Product Software Failing Traces Passing Traces Privacy Concern Performance Overhead (C. Luk et al., PLDI 2005) x Crash Stack f1 f2 f3 … fn x Test Cases Effectiveness (S. Artzi et al., ISSTA’10) 6
    7. 7. Our Research Goal How to help developers fix crashing faults? – Locate crashing faults based on crash stack 7
    8. 8. Our technique: CrashLocator • Target at locating faulty functions • No instrumentation needed • Approximate failing traces  Based on Crash Stacks  Use static analysis techniques • Rank suspicious functions  Without passing traces  Based on characteristics of faulty functions 8
    9. 9. Approximate Failing Traces • Basic Stack Expansion Algorithm A B C D Crash Stack E J M N Depth-1 F K L Depth-2 G H Depth-3 A B J C K L D E M N F G H Call Graph 9
    10. 10. functionposition File Line D0 file_0 l0 C1 file_1 l1 B2 file_2 l2 A3 file_3 l3 Crash Stack Approximate Failing Traces • Basic Stack Expansion Algorithm  Function call information only • Improved Stack Expansion Algorithm  Source file position information 10
    11. 11. Improved Stack Expansion Algorithm • Control Flow Analysis if J() … B() … Entry Exit In Crash Stack CFG of A A B C D Crash Stack E J M N Depth-1 F K L Depth-2 G H Depth-3 11
    12. 12. Improved Stack Expansion Algorithm • Backward Slicing 1. Obj D(){ 2. Obj s; 3. int a = M(); 4. char b = ‘’; 5. Obj[] c = N(b); 6. s=c[1]; //crash here 7. if(s!=‘’){ 8. … 9. } 8. … 9. } variables {s,c} A B C D Crash Stack E M N Depth-1 F Depth-2 G H Depth-3 Not in slicing 12
    13. 13. After crash stack expansion, there are still a large number of suspicious functions How to rank the suspicious functions? 13
    14. 14. Rank suspicious functions • An empirical study on the characteristics of faulty functions • Quantify the suspiciousness of suspicious functions 14
    15. 15. Observation 1: Frequent Function • Faulty functions appear frequently in the crash traces of the corresponding buckets.  Function Frequency (FF) Crash Report More Frequent, More Suspicious For 89-92% crashing faults, the associated faulty functions appear in all crash execution traces in the corresponding bucket. Crash Bucket 15
    16. 16. Frequent Function • Some frequent functions are unlikely to be buggy  Entry points (main, _RtlUserThreadStart, …)  Event handling routine (CloseHandle) • Information retrieval, some frequent words are useless  stop-words, e.g. “the”, “an”, “a”  Inverse Document Frequency (IDF) • Inverse Bucket Frequency (IBF)  If a function appears in many buckets, it is less likely to be buggy 16
    17. 17. Observation 2: Functions Close to Crash Point • Faulty functions appear closer to crash point  In Mozilla Firefox, for 84.3% of crashing faults, the distance between crash point and the associated faulty functions is less 5. • Inverse Average Distance to Crash Point (IAD) 17
    18. 18. Observation 3: Less Frequently Changed Functions • Functions that do not contain crashing faults are often less frequently changed  94.1% of faulty functions have been changed at least once during the past 12 months  Immune Functions (Y. Dang et al. ICSE 2012) • Less frequently changed functions  Functions that have no changes in past 12 months  Suspicious score is 0 18
    19. 19. Observation 4: Large Functions • Our prior study (H. Zhang. ICSM 2009) showed that large modules are more likely to be defect-prone • Function’s Lines of Code (FLOC) 19
    20. 20. Suspicious Score 𝑆𝑐𝑜𝑟𝑒 𝑓, 𝐵 = 𝐹𝐹 𝑓, 𝐵 ∗ 𝐼𝐵𝐹 𝑓 ∗ 𝐼𝐴𝐷 𝑓, 𝐵 ∗ 𝐹𝐿𝑂𝐶(𝑓) • FF (Function Frequency) 𝐹𝐹 𝑓, 𝐵 = 𝑁𝑓,𝐵 𝑁 𝐵 • IBF(Inverse Bucket Frequency) 𝐼𝐵𝐹 𝑓 = 𝑙𝑜𝑔( #𝐵 #𝐵𝑓 + 1) • IAD(Inverse Distance to Crash Point) 𝐼𝐴𝐷 𝑓, 𝐵 = 𝑁𝑓,𝐵 1 + 𝑗=1 𝑛 𝑑𝑖𝑠𝑗(𝑓) • FLOC(Function Lines of Code) 𝐹𝐿𝑂𝐶 𝑓 = 𝑙𝑜𝑔 (𝐿𝑂𝐶 𝑓 + 1) 20
    21. 21. Evaluation Subjects • Mozilla Products  5 releases of Firefox  2 releases of Thunderbird  1 release of SeaMonkey • 160 crashing faults(buckets) • Large-Scale  More than 2 million LOC  More than 120K functions 21
    22. 22. Evaluation Metrics • Recall@N: Percentage of successfully located faults by examining top N recommended functions • Mean Reciprocal Rank (MRR)  Measure the quality of the ranking results in IR  Range value: 0 ~ 1  Higher value means better ranking 22
    23. 23. Experimental Design • RQ1: How many faults can be successfully located by CrashLocator? • RQ2: Can CrashLocator outperform the conventional stack-only methods? • RQ3: How does each factor contribute to the crash localization performance? • RQ4: How effective is the proposed crash stack expansion algorithm? 23
    24. 24. RQ1: CrashLocator Performance System Recall@1 Recall@5 Recall@10 MRR Firefox 4.0b4 55.6% 66.7% 77.8% 0.627 Firefox 4.0b5 47.1% 70.6% 70.6% 0.566 Firefox 4.0b6 48.0% 64.0% 64.0% 0.540 Firefox14.0.1 52.0% 52.0% 56.0% 0.528 Firefox16.0.1 53.8% 53.8% 53.8% 0.542 Thunderbird17.0 48.5% 66.7% 78.8% 0.568 Thunderbird24.0 50.0% 66.7% 66.7% 0.544 SeaMonkey2.21 55.0% 70.0% 70.0% 0.600 Summary 50.6% 63.7% 67.5% 0.559 24
    25. 25. RQ2: Comparison with Stack-Only methods • Conventional Stack-Only Methods • StackOnlySampling • StackOnlyAverage • StackOnlyChangeDate 25
    26. 26. RQ2: Comparison with Stack-Only methods 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 5 10 20 50 100 Recall@N Top N Functions StackOnlySampling StackOnlyAverage StackOnlyChangeDate CrashLocator 26
    27. 27. RQ3: Contribution of Each Factors • Inverse Bucket Frequency (IBF) • Function Frequency (FF) • Function’s Lines of Code (FLOC) • Inverse Average Distance to Crash Point (IAD) 27
    28. 28. RQ3: Contribution of Each Factors 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 ff4.0b4 ff4.0b5 ff4.0b6 ff14.0.1 ff16.0.1 tb17.0 tb24.0 sm2.21 Summary MRR IBF IBF*FF IBF*FF*FLOC IBF*FF*FLOC*IAD 28
    29. 29. RQ4: Stack Expansion Algorithms • Basic Stack Expansion Algorithm  Static Call Graph • Improved Stack Expansion Algorithm  Static Call Graph  Control Flow Analysis  Backward Slicing 29
    30. 30. RQ4: Stack Expansion Algorithms 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Recall@1 Recall@5 Recall@10 Recall@20 Recall@50 MRR Basic Stack Trace Expansion Improved Stack Trace Expansion 30
    31. 31. Conclusions • Propose a novel technique CrashLocator to locate crashing faults based on crash stack only • Evaluate on real and large-scale projects • 50.6%, 63.7%, and 67.5% of crashing faults can be located by examining only top 1,5,10 functions • CrashLocator outperforms Stack-Only methods significantly, with the improvement of MRR at least 32% and the improvement of Recall@10 at least 23% 31