‘Ghostwriting’ property-based tests

A paper on https://hypothesis.readthedocs.io/en/latest/ghostwriter.html It’s clearly related to prior art in test generation (Randoop, Evosuite), fuzz-target generation (FUDGE, FuzzGen), and property discovery (Daikon, QuickSpec, Speculate); but the Hypothesis ghostwriter takes a substantially different approach. This is another paper where the implementation is done, leaving only evaluation and writing. Evaluation demos tests-from-stubs, but focusses on utility/data structure libraries like colorsys, IDNA, bidict, etc.


Property-based testing is both more effective and often cheaper than traditional approaches to unit-testing, but has seen only limited adoption: even the popular Hypothesis library is used by only around 4% of Python programmers 1. After presenting at many Python conferences, the main barriers to adoption are (1) uncertainty about how to identify properties to test, and (2) the cost of learning the API.

I therefore introduce the Hypothesis ‘Ghostwriter’, which outputs the source code for property-based tests of the specified functions or modules. Depending on the nature and type annotations of the system under test, ghostwritten tests may be run as-is, or serve as stubs for the programmer to fill out.

Unlike previous work on automatic test generation, the Ghostwriter does not use coverage analysis, execute the system under test, or mine example usages - ensuring that tests reflect the ideal rather than actual behaviour, and allowing use on unfinished or prototype code (‘test-generation driven development’). Ghostwritten tests achieve XX-YY% branch coverage on NN Python libraries and found NN previously unknown bugs. Of the NNN tests, NNN were complete, NNN required minor edits to compensate for missing type annotations, and only NN required nontrivial patches.


vs about 75% engaged in any form of testing, and 49% using the pytest test runner (PSF survey).


[SPKT18] has a nice intro and lots of references to follow up.

Previous test-generation tools

Previous work on test generating has taken on of three basic approaches.

Randoop [PLEB07] and EvoSuite [FA11] , and the Python implementation Pynguin [LKF20]. Random generation of tests, by constructing and aggregating fragments based on type. No coincidence that it’s Java; but the generated tests are pretty weak. Randoop works “forwards” (think stateful testing); EvoSuite works “backwards” (like from_type), but IMO these are basically the same technique.

FUDGE [BBC+19] and FuzzGen [IAMP20] mine existing code to find chains of method calls which exercise the API of a library. They use coverage information to prioritise candidates for human inspection.

Daikon [EPG+07] detects program invariants, by tracking whether each of a universe of possible invariants is violated at runtime. QuickSpec [SJCA17] and Speculate [BR17] use a related technique, speculating on equational laws and using QuickCheck [CH00] to test for violating examples.

Note that all of these involve executing the code under test and measuring coverage. I argue that

  1. test generation should avoid assuming that the present behaviour is the correct behaviour, as it limits use for unfinished software and ignores the possible presence of bugs, and

  2. that coverage makes a poor metric - while all effective test suites will have high coverage, there is a great deal of variation at each level of coverage and attempting to reduce a test suite subject to coverage may weaken it substantially.

[HostOstvold09] leverages implicit naming conventions to check whether a method name is appropriate for the implementation, while [PVSS+07] exploits natural-langugage information in comments as well as method names. Investigating inference of partial specifications from natural langugage comments and documentation in addition to names appears promising for dynamic and optionally-type-annotated langugages such as Python or Javascript.

There is some recent work on name-based bug detection; see DeepBugs [PS18] by Michael Pradel. Prior work considered mismatches between the names of the parameter and the variable used as an argument ([RAJ+17], [PG11], [LLS+16]), which is less useful when designing a ghostwriter.

The Hypothesis Ghostwriter

About the tool, goals, overview of functionality


How it works. Hypothesis’ from_type logic, recursive building (ala evosuite), registration of custom types, entry-point plugins. Predefined properties. Heuristics used to select properties. How we import objects for introspection. Skipping only as little as possible for invalid parts. Designed as an input to human tuning, not complete tests.


Ghostwritten tests achieve `XX-YY%` branch coverage on `NN` Python libraries and found `NN` previously unknown bugs. Of the `NNN` tests, `NNN` were complete, `NNN` required minor edits to compensate for missing type annotations, and only `NN` required nontrivial patches.

I evaluate the Ghostwriter by applying it to NN Python libraries, and reporting the resulting branch coverage for each, the number of bugs discovered, and the work required to complete ghostwritten tests (none, minor, or substantial).



  • Simple libraries: IDNA, colorsys, bidict, etc.

  • BugsInPy [WSL+20] dataset

  • Quixbugs [LKCSL17] dataset

  • Pynguin’s [LKF20] evaluation dataset

  • others?

Results (plus distribution over 30+ trials each):

  • branch coverage

  • number of test failures

  • number of bugs detected (failures can be duplicate or non-bugs)

  • report size of patch required to make each test useful (none, adjust strategies, adjust body, major revisions, discard entirely)

I also conduct a user study with open source maintainers and undergraduate students… TODO: draft survey instrument, seek ethics approval.



Domagoj Babic, Stefan Bucur, Yaohui Chen, Franjo Ivancic, Tim King, Markus Kusano, Caroline Lemieux, László Szekeres, and Wei Wang. Fudge: fuzz driver generation at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019. URL: https://research.google/pubs/pub48314/.


Rudy Braquehais and Colin Runciman. Speculate: discovering conditional equations and inequalities about black-box functions by reasoning from test results. In Proceedings of the 10th ACM SIGPLAN International Symposium on Haskell. ACM, 2017. URL: https://matela.com.br/paper/speculate.pdf, doi:10.1145/3122955.3122961.


Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming, 2000. URL: https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf, doi:10.1145/357766.351266.


Michael D Ernst, Jeff H Perkins, Philip J Guo, Stephen McCamant, Carlos Pacheco, Matthew S Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants. Science of computer programming, 69(1-3):35–45, 2007. URL: https://plse.cs.washington.edu/daikon/pubs/.


Gordon Fraser and Andrea Arcuri. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ‘11, 416–419. New York, NY, USA, 2011. Association for Computing Machinery. URL: https://dl.acm.org/doi/pdf/10.1145/2025113.2025179, doi:10.1145/2025113.2025179.


Einar W. Høst and Bjarte M. Østvold. Debugging method names. In Proceedings of the 23rd European Conference on ECOOP 2009 — Object-Oriented Programming, Genoa, 294–317. Berlin, Heidelberg, 2009. Springer-Verlag. URL: https://www.nr.no/directdownload/5006/H_st_-_Debugging_Method_Names.pdf, doi:10.1007/978-3-642-03013-0_14.


Kyriakos Ispoglou, Daniel Austin, Vishwath Mohan, and Mathias Payer. Fuzzgen: automatic fuzzer generation. In 29th USENIX Security Symposium (USENIX Security 20), 2271–2287. USENIX Association, August 2020. URL: https://www.usenix.org/conference/usenixsecurity20/presentation/ispoglou.


Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. Quixbugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, SPLASH Companion 2017, 55–56. New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3135932.3135941, doi:10.1145/3135932.3135941.


Hui Liu, Qiurong Liu, Cristian-Alexandru Staicu, Michael Pradel, and Yue Luo. Nomen est omen: exploring and exploiting similarities between argument and parameter names. In Proceedings of the 38th International Conference on Software Engineering, ICSE ‘16, 1063–1073. New York, NY, USA, 2016. Association for Computing Machinery. URL: https://doi.org/10.1145/2884781.2884841, doi:10.1145/2884781.2884841.


Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. Automated unit test generation for python. Lecture Notes in Computer Science, pages 9–24, 2020. URL: https://arxiv.org/abs/2007.14049, doi:10.1007/978-3-030-59762-7_2.


Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. Feedback-directed random test generation. In ICSE 2007, Proceedings of the 29th International Conference on Software Engineering, 75–84. Minneapolis, MN, USA, 2007. URL: https://homes.cs.washington.edu/~mernst/pubs/feedback-testgen-icse2007.pdf.


Lori Pollock, K. Vijay-Shanker, David Shepherd, Emily Hill, Zachary P. Fry, and Kishen Maloor. Introducing natural language program analysis. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE ‘07, 15–16. New York, NY, USA, 2007. Association for Computing Machinery. URL: https://doi.org/10.1145/1251535.1251538, doi:10.1145/1251535.1251538.


Michael Pradel and Thomas R. Gross. Detecting anomalies in the order of equally-typed method arguments. In Proceedings of the 2011 International Symposium on Software Testing and Analysis, ISSTA ‘11, 232–242. New York, NY, USA, 2011. Association for Computing Machinery. URL: https://doi.org/10.1145/2001420.2001448, doi:10.1145/2001420.2001448.


Michael Pradel and Koushik Sen. Deepbugs: a learning approach to name-based bug detection. Proc. ACM Program. Lang., October 2018. URL: https://doi.org/10.1145/3276517, doi:10.1145/3276517.


Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. Detecting argument selection defects. Proc. ACM Program. Lang., October 2017. URL: https://doi.org/10.1145/3133928, doi:10.1145/3133928.


Marija Selakovic, Michael Pradel, Rezwana Karim, and Frank Tip. Test generation for higher-order functions in dynamic languages. Proc. ACM Program. Lang., October 2018. URL: https://doi.org/10.1145/3276531, doi:10.1145/3276531.


NICHOLAS SMALLBONE, MOA JOHANSSON, KOEN CLAESSEN, and MAXIMILIAN ALGEHED. Quick specifications for the busy programmer. Journal of Functional Programming, 2017. URL: http://www.cse.chalmers.se/~nicsma/papers/quickspec2.pdf, doi:10.1017/s0956796817000090.


Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, Brian Goh, Ferdian Thung, Hong Jin Kang, Thong Hoang, David Lo, and Eng Lieh Ouh. Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, 1556–1560. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://github.com/soarsmu/BugsInPy, doi:10.1145/3368089.3417943.