Scrutineer: Integrating Fault Localisation with Property-Based Testing¶
Unlike most of the tools I’m writing about here, Scrutineer does not yet exist. However I find the idea promising enough that an investigation of why it didn’t work would also be compelling.
Having shipped “explain mode” in Hypothesis, I’ve put this idea down again as there’s no clear way forward to a good user experience. See also [PO11]. From 2021, UniVal [KHP21] is at least interestingly causal…
Software testing tools aim to help people find bugs, and having found bugs - to understand and then fix them. Property-based testing traditionally approaches these goals by generating random inputs to a user-provided test function, and shrinking any failures to a minimal reproducing example. Simple examples are valuable aid for debugging, but the emphasis remains on bug discovery.
Fault localisation (FL) tools recommend where to look for the causes of a known bug, based on the difference in program behaviour between passing and failing tests. Previous work shares three fundamental challenges which are mitigated by combining fault localisation with property-based testing rather than a hand-written test suite:
We ensure that passing and failing program behaviours are related by comparing test inputs rather than test functions, localising the cause of a particular test failure rather than a speculated bug in the program. This returns responsibility for interpretation to the user, is more precise, and admits the possibility of fixing the test harness rather than the program.
If there are several distinct bugs in the program, instead of conflating their causes we localise each separately using Hypothesis’ usual distinction by exception type and location. This heuristic works well in practice, and when it draws too-fine distinctions fault localisation can identify a shared cause for “multiple” failures.
We argue that integration with property-based testing and opportunistic reporting provide a substantially better user experience than previous work focussed on standalone fault-localisation tools, and that these benefits further our ultimate goals of helping people to find, understand, and fix more bugs.
We further argue that the statistical tools of causal inference provide a sound basis for analysis and comparison of fault localisation tools.
This section essentially expands on the abstract.
intro to PBT and Hypothesis, generalising/abstracting failures
intro to FL and prior art
motivation / problem statement
what do users expect from FL? [KXLL16] is a nice user-survey paper; I find the question framing as useful as the actual (I take as indicative) results.
AURORA [BSchlogelA+20] is pretty close to my goal for Scrutineer; it uses AFL to explore a crashing region where we use (and integrate with) Hypothesis. I doubt that the sheer volume of predicates that it uses will be practical in higher level contexts than native binary executables though.
[Zel02] is also very useful - real causal information by intervening on program states, and chaining that backwards to get real explanations in terms of a sequence of necessary and sufficient causes.
ALHAZEN, in [KHSZ20] (also featuring Zeller), uses a grammar to learn decision-tree features which predict failure. This looks to me like correlation-based generalisation rather than something properly causal, but the techniques involved in dynamic exploration could be useful.
Reduce Before You Localize: Delta-Debugging and Spectrum-Based Fault Localization [COAG18] finds that shrinking to a minimal example before running fault localization helps - more than switching between the worst and best ranking formulae. It’s unclear how much this would help given my ‘set based’ approach based on control-flow divergence approach (below), but it’s definitely relevant prior art given that Scrutineer traces while shrinking!
(more recently than the rest of these notes), UniVal [KHP21] improves FL by integrating value- and predicate-based causal inference. First, execution traces are recorded in a single-static-assignment form which captures both control-flow and dataflow dependencies. Second, a counterfactual random forest model is fitted for each candidate fault location (i.e. treatment variable), which are then scored by their maximum estimated contribution to causing test failure.
UniVal is particularly nice as a demonstration of how to avoid confounding, though I remain somewhat sceptical about usefulness in large codebases and whether presenting dataflow analysis is net good for users. The evaluation is nice enough that I trust their results; I’m just concerned about external validity for deployment. Related work section is also great, and includes a few relevant papers which should be added here.
See also the preceeding paper, “CounterFault: Value-Based Fault Localization by Modeling and Predicting Counterfactual Outcomes”.
note that other recent work has investigated generalisation/abstraction, but it’s not seen anything like the success of FL. See https://github.com/HypothesisWorks/hypothesis/issues/2192
The problem is that we have some program, and can generate a large number of traces (paths through the control-flow graph, which is directed but may have cycles) with their labels (OK, failed with error1, failed with error2, …). It’s easy to tell that each trace is sufficient to trigger whichever bug; but the goal is to determine which part of the trace is necessary so that the bug can be more easily fixed.
In this framing I ignore program state, treating it as a latent variable that may change and/or be observed at each node. This gives us a tractable probabilistic problem rather than an intractably large deterministic problem.
Because traces can be very long, we usually summarise them as the number of times each each edge in the graph was visited, or even just the set of edges visited at least once. This is both useful for performance, and allows us to directly report to the user which lines are involved without complicated qualifiers about prior execution paths.
Precise problem framing is really important - I suspect that Judea Pearl’s work on causal inference provides a good theoretical grounding to understand both when and why fault localisation works. Prior work in this space has criticised correlation-only tools, but not explicitly drawn from the “ladder of causation”.
https://github.com/hchasestevens/fault-localization is a current tool for Python - simply comparing the number of times each line of code was executed by passing and failing tests; and flagging those disproportionally executed by failing tests as suspicious. This is the most basic approach in the literature; while it sometimes works nobody finds it reliable.
I claim that this is precisely because such tools are providing a correlation, but users want a causal finding: “changing this line would make test failure less likely”. Unfortunately this is a counterfactual claim, which we do not have grounds to make without interventional study - and our interventions are on inputs, not the code which processes them.
A very simple approach is to consider the set of lines which are always run for failing inputs, never run for passing inputs, and then select only those which can be reached from the start of the control-flow graph without passing through another such line. I have most of a prototype for this approach; the main downside is that it often reports an empty set where more nuanced approaches need not.
Naively you could compare the probabilities of visiting each edge in the graph conditional on the trace label, and then guess that the divergences closest to the start node are the source of the bug… but the traces for each label are drawn from different distributions, so the naive approach would yield many false-positives.
A related paper on the ‘inflection point hypothesis’ [ZRL+19] suggests that we model program execution as a linear sequence of instructions; and then by comparing our failing trace to the passing trace with the longest common prefix, that point of divergence is the root cause of failure. False-positives are obviously possible; this approach relies on successfully searching for very similar passing traces. Note that the main difference from the above is that ‘inflection points’ rely on non-aggregated execution paths.
Tim Blazytko, Moritz Schlögel, Cornelius Aschermann, Ali Abbasi, Joel Frank, Simon Wörner, and Thorsten Holz. AURORA: statistical crash analysis for automated root cause explanation. In 29th USENIX Security Symposium (USENIX Security 20), 235–252. USENIX Association, 2020. URL: https://www.usenix.org/conference/usenixsecurity20/presentation/blazytko.
Arpit Christi, Matthew Lyle Olson, Mohammad Amin Alipour, and Alex Groce. Reduce before you localize: delta-debugging and spectrum-based fault localization. In 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), volume, 184–191. 2018. URL: https://web.engr.oregonstate.edu/~olsomatt/issre_w18.pdf, doi:10.1109/ISSREW.2018.00005.
Alexander Kampmann, Nikolas Havrikov, Ezekiel O. Soremekun, and Andreas Zeller. When does my program do this? learning circumstances of software behavior. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, 1228–1239. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://publications.cispa.saarland/3107/, doi:10.1145/3368089.3409687.
Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. Practitioners\textquotesingle expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, July 2016. URL: https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=4576&context=sis_research, doi:10.1145/2931037.2931051.
Yigit Kucuk, Tim A. D. Henderson, and Andy Podgurski. Improving fault localization by integrating value and predicate based causal inference techniques. 2021. arXiv:2102.06292.
Tien-Duy B. Le and David Lo. Will fault localization work for these failures? an automated approach to predict effectiveness of fault localization tools. In 2013 IEEE International Conference on Software Maintenance. IEEE, September 2013. URL: http://www.mysmu.edu/faculty/davidlo/papers/icsm13-localization.pdf, doi:10.1109/icsm.2013.42.
Lucia, Ferdian Thung, David Lo, and Lingxiao Jiang. Are faults localizable? In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR '12, 74–77. IEEE Press, 2012. URL: http://www.mysmu.edu/faculty/davidlo/papers/msr12-fault.pdf.
Chris Parnin and Alessandro Orso. Are automated debugging techniques actually helping programmers? In Proceedings of the 2011 International Symposium on Software Testing and Analysis, ISSTA '11, 199–209. New York, NY, USA, 2011. Association for Computing Machinery. URL: https://www.cc.gatech.edu/people/home/orso/papers/parnin.orso.ISSTA11.pdf, doi:10.1145/2001420.2001445.
Andreas Zeller. Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering, SIGSOFT '02/FSE-10, 1–10. New York, NY, USA, 2002. Association for Computing Machinery. URL: http://www.cs.umd.edu/~atif/zeller.pdf, doi:10.1145/587051.587053.
Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan. The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, 131–146. New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3341301.3359650, doi:10.1145/3341301.3359650.