‘Ghostwriting’ property-based tests¶
Property-based testing is both more effective and often cheaper than traditional approaches to unit-testing, but has seen only limited adoption: even the popular Hypothesis library is used by only around 4% of Python programmers. Prospective users report that the primary barriers to adoption are uncertainty about how to identify appropriate test oracles, and whether learning and adopting property-based testing would be a good investment. In this paper I attempt to address their concerns via automatic test generation.
The Hypothesis ‘Ghostwriter’ produces idiomatic property-based tests, heuristically identifying common properties 1 in user code based on function names and infering input domains from type annotations and natural-language documentation. Unlike previous tools, the Ghostwriter does not use code coverage, execute the code to be tested, or mine usage examples - enabling “test-generation driven development” starting from function signatures. OSS-Fuzz recommends using Hypothesis and ghostwritten tests for complex inputs [Con21].
Ghostwritten tests can often be run as-is, but degrade to clearly commented stubs
for user completion when insufficient information is available. Of
ghostwriten tests for
NN open source libraries,
NNN were complete,
NNN required type annotations or minor edits, and only
i.e. oracles such as
x == load(save(x))or
f(a, b) == f(b, a); the term property-based testing derives from the use of such algebraic properties as test oracles.
Property-based testing originated with QuickCheck [CH00], a popular tool for Haskell which has been widely imitated and used in functional programming. Although property-based testing is both more effective and often cheaper than traditional manual approaches to unit-testing, it remains rare in mainstream languages. In the annual Python Software Foundation survey, 78% of respondents at least sometimes write tests for their code [PythonSoftwareFoundation20], but only 4% use Hypothesis [MHDC19,MD20].
A common theme when discussing Hypothesis at industry conferences or local user groups is that, having independently discovered it, many users are reluctant to invest time in learning or adopting property-based testing because they are uncertain how to identify testable properties of their code - i.e. partial specifications which are suitable as test oracles.
In this paper I attempt to address their concerns via automatic test generation using what [BHM+15] calls “derived oracles” based on function names or the “implicit oracle” that top-level functions should not raise exceptions. The generated tests are designed to “earn the trust of testers” [FSM+13], prioritising readability and correctness over completeness or even executability. To support experimental use, the generation process takes less than a second and is safe even if the code to be tested does IO.
Previous test-generation tools¶
Previous work on test generating has taken one of three basic approaches.
Note that all of these involve executing the code under test and measuring coverage. I argue that
test generation should avoid assuming that the present behaviour is the correct behaviour, as it limits use for unfinished software and ignores the possible presence of bugs, and
that coverage makes a poor metric - while all effective test suites will have high coverage, there is a great deal of variation at each level of coverage and attempting to reduce a test suite subject to coverage may weaken it substantially.
Random OOP test-suite generation¶
Randoop [PLEB07] and EvoSuite [FA11] , and the Python implementation
Random generation of tests, by constructing and aggregating fragments based
on type. No coincidence that it’s Java; but the generated tests are pretty weak.
Randoop works “forwards” (think stateful testing); EvoSuite works “backwards”
from_type), but IMO these are basically the same technique.
[FSM+13], from the authors of Evosuite, finds that using Evosuite
leads to clear improvements in commonly applied quality metrics such as code coverage [but] no measurable improvement in the number of bugs actually found by developers. … automatically generating a set of test cases, even high coverage test cases, does not necessarily improve our ability to test software.
Testing speculative program invariants¶
Daikon [EPG+07] detects program invariants, by tracking whether each of a universe of possible invariants is violated at runtime. QuickSpec [SJCA17] and Speculate [BR17] use a related technique, speculating on equational laws and using QuickCheck [CH00] to test for violating examples.
Slice programs into fuzz drivers¶
FUDGE [BBC+19] and FuzzGen [IAMP20] mine existing code to find chains of method calls which exercise the API of a library. They use coverage information to prioritise candidates for human inspection. WINNIE [JTH+21] does a similar trick, but using execution traces from a binary instead of source code analysis to identify targets and calling conventions.
The Hypothesis Ghostwriter¶
The goal of the Hypothesis Ghostwriter is to help users get started: to teach and to drive adoption of property-based testing. For novices, this means showing how to use Hypothesis’ library functions to write tests, and suggest properties which might be useful for their domain or application.
For experts, providing credible code is simply a time-saver compared to writing everything by hand.
rather than to improve productivity for experienced users. Treating the tool as a substitute for documentation drives several other design decisions:
Zero-config setup reduces the odds that prospective users will abandon property-based testing before they see it in action - or get stuck if they don’t read the documentation.
Idiomatic output means users can start immediately, by imitating ghostwritten code, and invest time studying the API and design principles only after they see value in it.
Incomplete is better than incorrect - it provides clear and concrete next steps for the user, which can be directed by
# TODO:code comments.
At the implementation level the Ghostwriter consists of Hypothesis’ preexisting techniques to infer input-generation “strategies” from types, a set of known properties and the heuristics to recognise each, and a morass of glue code to translate between objects and source code strings in each direction.
Inferring input domains
Hypothesis’ from_type logic, recursive building (ala evosuite), registration of custom types, entry-point plugins.
Predefined properties. Heuristics used to select properties. tests are named for the SUT and property; simpler than [DRF17] but same motivation.
Writing the test code¶
How we import objects for introspection. Skipping only as little as possible for invalid parts. Designed as an input to human tuning, not complete tests.
Divide this section into test quality (qualitative) and test performance (quantitative).
I evaluate the ghostwriter in NN ways:
1. On qualitative similarity to human-written tests for open-source projects. 3. Based on user surveys from conference workshops.
Marcel suggests: can we look at how many OSS-Fuzz entries have used Hypothesis and the ghostwriter? (too soon/few yet; but hopefully by time of publication)
Using a likert-like scale to subjectively evaluate test quality
Related question: “if this test failed, would you know why?” “would test failure be more likely to reveal a problem with the test, or the SUT?” -> important note, useful for education, not just test generation
Following the qualitative assessment, I attempt to quantify the performance of ghostwritten tests.
Ghostwritten tests achieve `XX-YY%` branch coverage on `NN` Python libraries and found `NN` previously unknown bugs. Of the `NNN` tests, `NNN` were complete, `NNN` required minor edits to compensate for missing type annotations, and only `NN` required nontrivial patches.
I evaluate the Ghostwriter by applying it to
NN Python libraries, and
reporting the resulting branch coverage for each, the number of bugs discovered,
and the work required to complete ghostwritten tests (none, minor, or substantial).
as-is / static type-inference with Pyre / dynamic type-inference with Monkeytype
default settings / longer running / coverage-guided fuzzer
however many configs make sense
Simple libraries: IDNA, colorsys, bidict, etc.
BugsInPy [WSL+20] dataset
Quixbugs [LKCSL17] dataset
Pynguin’s [LKF20] evaluation dataset
Unimplemented stubs for e.g. sorting,
others? e.g. language model evaluation sets in future-directions, below
Results (plus distribution over 30+ trials each):
number of test failures
number of bugs detected (failures can be duplicate or non-bugs)
report size of patch required to make each test useful (none, adjust strategies, adjust body, major revisions, discard entirely)
A different tack on evaluation: looking at the quality of generated tests, not just their coverage - supporting the idea that “guessed oracles” are helpful. [SZ11] and [SZ13] very useful here, especially if I can borrow the backwards-dynamic-slicing code from https://www.debuggingbook.org/html/Slicer.html
The Hypothesis Ghostwriter demonstrates that fast heuristic test generation is a viable approach, and has been both adopted and imitated in the wild.
Future research could infer program semantics and input types from method names [HostOstvold09] or natural-langugage comments or documentation [PVSS+07], define test templates for additional properties such as metamorphic relations, or use ghostwritten tests as a starting point for coverage- or mutation-driven test suite modification.
Integrating code generation from large language models (ala GPT-3 or GitHub Copilot): https://arxiv.org/pdf/2107.03374.pdf https://github.com/openai/human-eval https://arxiv.org/pdf/2105.09938.pdf https://github.com/hendrycks/apps https://arxiv.org/pdf/2009.05617.pdf
Domagoj Babic, Stefan Bucur, Yaohui Chen, Franjo Ivancic, Tim King, Markus Kusano, Caroline Lemieux, László Szekeres, and Wei Wang. Fudge: fuzz driver generation at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. URL: https://research.google/pubs/pub48314/.
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The oracle problem in software testing: a survey. IEEE Transactions on Software Engineering, 41(5):507–525, May 2015. URL: https://doi.org/10.1109/tse.2014.2372785, doi:10.1109/tse.2014.2372785.
Rudy Braquehais and Colin Runciman. Speculate: discovering conditional equations and inequalities about black-box functions by reasoning from test results. In Proceedings of the 10th ACM SIGPLAN International Symposium on Haskell. ACM, 2017. URL: https://matela.com.br/paper/speculate.pdf, doi:10.1145/3122955.3122961.
Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming, 2000. URL: https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf, doi:10.1145/357766.351266.
OSS Fuzz Contributors. Integrating a python project. 2021. URL: https://google.github.io/oss-fuzz/getting-started/new-project-guide/python-lang/.
Ermira Daka, José Miguel Rojas, and Gordon Fraser. Generating unit tests with descriptive names or: would you name your children thing1 and thing2? In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, July 2017. URL: https://doi.org/10.1145/3092703.3092727, doi:10.1145/3092703.3092727.
Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants. Science of computer programming, 69(1-3):35–45, 2007. URL: https://plse.cs.washington.edu/daikon/pubs/.
Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE '11, 416–419. New York, NY, USA, 2011. Association for Computing Machinery. URL: https://dl.acm.org/doi/pdf/10.1145/2025113.2025179, doi:10.1145/2025113.2025179.
Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. Does automated white-box test generation really help software testers? In Proceedings of the 2013 International Symposium on Software Testing and Analysis. ACM, July 2013. URL: https://doi.org/10.1145/2483760.2483774, doi:10.1145/2483760.2483774.
Einar W. Høst and Bjarte M. Østvold. Debugging method names. In Proceedings of the 23rd European Conference on ECOOP 2009 — Object-Oriented Programming, Genoa, 294–317. Berlin, Heidelberg, 2009. Springer-Verlag. URL: https://www.nr.no/directdownload/5006/H_st_-_Debugging_Method_Names.pdf, doi:10.1007/978-3-642-03013-0_14.
Kyriakos Ispoglou, Daniel Austin, Vishwath Mohan, and Mathias Payer. Fuzzgen: automatic fuzzer generation. In 29th USENIX Security Symposium (USENIX Security 20), 2271–2287. USENIX Association, August 2020. URL: https://www.usenix.org/conference/usenixsecurity20/presentation/ispoglou.
Jinho Jung, Stephen Tong, Hong Hu, Jungwon Lim, Yonghwi Jin, and Taesoo Kim. Winnie: fuzzing windows applications with harness synthesis and fast cloning. In Network and Distributed Systems Security (NDSS) Symposium. 2021. URL: https://www.ndss-symposium.org/wp-content/uploads/2021-334-paper.pdf.
Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. Quixbugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, SPLASH Companion 2017, 55–56. New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3135932.3135941, doi:10.1145/3135932.3135941.
Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. Automated unit test generation for python. Lecture Notes in Computer Science, pages 9–24, 2020. URL: https://arxiv.org/abs/2007.14049, doi:10.1007/978-3-030-59762-7_2.
David MacIver, Zac Hatfield-Dodds, and Many Contributors. Hypothesis: a new approach to property-based testing. Journal of Open Source Software, 4(43):1891, 2019. URL: https://doi.org/10.21105/joss.01891, doi:10.21105/joss.01891.
David R. MacIver and Alastair F. Donaldson. Test-Case Reduction via Test-Case Generation: Insights from the Hypothesis Reducer (Tool Insights Paper). In Robert Hirschfeld and Tobias Pape, editors, 34th European Conference on Object-Oriented Programming (ECOOP 2020), volume 166 of Leibniz International Proceedings in Informatics (LIPIcs), 13:1–13:27. Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/13170, doi:10.4230/LIPIcs.ECOOP.2020.13.
Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. Feedback-directed random test generation. In ICSE 2007, Proceedings of the 29th International Conference on Software Engineering, 75–84. Minneapolis, MN, USA, 2007. URL: https://homes.cs.washington.edu/~mernst/pubs/feedback-testgen-icse2007.pdf.
Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, and Kishen Maloor. Introducing natural language program analysis. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE '07, 15–16. New York, NY, USA, 2007. Association for Computing Machinery. URL: https://doi.org/10.1145/1251535.1251538, doi:10.1145/1251535.1251538.
David Schuler and Andreas Zeller. Assessing oracle quality with checked coverage. In 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, March 2011. URL: https://doi.org/10.1109/icst.2011.32, doi:10.1109/icst.2011.32.
David Schuler and Andreas Zeller. Checked coverage: an indicator for oracle quality. Software Testing, Verification and Reliability, 23(7):531–551, May 2013. URL: https://doi.org/10.1002/stvr.1497, doi:10.1002/stvr.1497.
NICHOLAS SMALLBONE, MOA JOHANSSON, KOEN CLAESSEN, and MAXIMILIAN ALGEHED. Quick specifications for the busy programmer. Journal of Functional Programming, 2017. URL: http://www.cse.chalmers.se/~nicsma/papers/quickspec2.pdf, doi:10.1017/s0956796817000090.
Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, Brian Goh, Ferdian Thung, Hong Jin Kang, Thong Hoang, David Lo, and Eng Lieh Ouh. Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, 1556–1560. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://github.com/soarsmu/BugsInPy, doi:10.1145/3368089.3417943.
Python Software Foundation. Python developers survey 2020 results. 2020. URL: https://www.jetbrains.com/lp/python-developers-survey-2020/.