Generating inhumane Python source code with Hypothesmith

A paper on https://github.com/Zac-HD/hypothesmith - generating random Python programs. I plan to upgrade it some more, primarily to be smarter about use of concrete syntax trees, and then go bug-hunting. I may also implement generation of safe-to-execute programs.

Note: planned evaluation is mostly bug-hunting, but will include assessing PBT tricks like targeted testing, swarm testing, and coverage feedback.

Abstract

Random generation of valid programs is an important technique for testing the many tools which operate on source code - including compilers, static analysers, and code transformation tools. While a mature field for C and related languages, program generation for Python is particularly challenging due to context-sensitive grammar and absence of any language specification.

I present Hypothesmith, a library of generators for valid Python source code in the Hypothesis property-based testing framework. Hypothesmith can generate source code based on an approximate grammar or by unparsing a concrete syntax tree, and guarantees that generated is always accepted by Python’s builtin compile() function.

Hypothesmith has already been adopted by CPython (caveats), Instagram’s LibCST project, the Black autoformatter, and a range of other linters and formatting tools. These tests have found NNN bugs, including a segfault bug which delayed the planned release of Python 3.9’s new PEG parser.

Reflections on the development of Hypothesmith

Recently, Hypothesmith has been considered as part of the testing strategy for a new parser in CPython - while still at a rather experimental stage.

This is possible largely because of of the architecture and ecosystem of extensions that have built up around Hypothesis (the underlying library for property-based testing), which I have found compoundingly valuable.

I hope that this experience report encourages other researchers to follow similar design principles.

Background on Hypothesis

  • General stats - very widely used in open source and industry

  • Existed in modern form since 2016; I’ve been a core dev since 2017.

  • David’s “how Hypothesis works” post. Key points
    • Instead of type-based or (lazy) tree-based generation and shrinking, Hypothesis is basically parser combinators composed with a byte-oriented generative/mutational fuzzer.

    • Literally everything can thus be saved (as a bytestring) and replayed

    • Tracking the parse tree makes mutation and shrinking way way more efficient. Heirarchal delta debugging is basically free, and the least of it - David has done a lot of really cool work on shrinking (see repo docs!)

    • Shrinking is compositional so users never need to know it exists at all.

    • This is an amazing IR layer for so much else. More on that later.

    • Solves the ‘fuzzer taming’ problem - we track a canonical minimal example (shortlex on buffer -> total ordering on test cases) for every exception type from a distinct line number and file. This also means ‘reducers are fuzzers’ is strictly positive for us.

  • Some neat extensions:
    • from_type() - we can infer strategies from types, and/or users can supply explicit generators.

    • extra.lark.from_grammar() - generate strings matching an EBNF grammar

    • from_regex() - ditto from Python’s PCRE (not a regular grammar!)

    • target() - any Hypothesis test can add targeted property-based testing. Where PropEr uses simulated annealing of a single metric, Hypothesis does a hillclimbing search from each vertex of the (approximate) pareto frontier of as many metrics as you like. (Cite Loscher and Sagonas, 2017 and 2018)

    • Hypothesis’ stateful testing - state-machine-based, like Scala’s (? check) - automatically embeds ‘swarm testing’ (cf A. Groce). (we’ve been meaning to add this to from_grammar too, which I should do soon)

  • Lots of third-party extensions as well as use.

  • Cite David and my JOSS paper, ref to (in press? ask him) Batten’s PyMTL paper.

Inspiration for Hypothesmith

  • Obvious from the name John Regehr’s CSmith was my main inspiration. Why should C programmers get to have all the fun?

  • While CSmith also requires CReduce (like AFL / AFL-cmin + tmin), Hypothesis gets shrinking for free. This generalises: David experimented with backing CSmith with Hypothesis internals and it works pretty well, though there’s some minimal boilerplate that C-Reduce could remove.

  • Prompted by discussion at PyCon US 2019 - new autoformatter Black was out, and the author mentioned that JS devs of Prettier had raved about fuzzing. So why not help out?

  • This implicitly made the case for the usefulness of a a syntax-only generator, which is much easier to implement than maintaining runtime properties by construction - but still useful for linters, formatters, parsers, etc. C.F. Regehr’s ‘levels of fuzzing’ esssay.

Development of Hypothesmith

A note on Python’s grammar

For simplicity, I’ll confine this to discussion of Python 3.8 and 3.9; for our purposes earlier versions still in use are much the same.

  • Python ships with a grammar file, using a custom EBNF dialect (doesn’t everyone!)

  • The Python 3.8 parser is LL(1). This is lovely, but has some problems with productions that are idiomatically left-recursive. There is thus an extensive post-processing step that rejects many strings accepted by the grammar - see https://www.python.org/dev/peps/pep-0617/ for a fuller discussion of reasons to move to the new PEG parser.

  • (forecast at time of writing) In Python 3.9, this is replaced by a PEG parser. Because the parser has an ordered choice operator an unlimited backtracking there is no ambiguity, so the grammar can be fully precise.

Early versions

Version 0.0.1 consisted of taking a translation of Python’s grammar into the ENBF format that Hypothesis already understood, using our existing from_grammar() function, plus some quick patches to make indentation work and replace the too-general regex pattern for identifiers.

This mostly didn’t work, because it had no access to Python’s post-processing logic. While accepted by the grammar, generated strings could not be parsed.

Version 0.0.2 added a global filter, using “accepted by Python’s compile builtin” as an oracle to detect valid source code. This was sound, but highly inefficient - almost all outputs contained only whitespace and a comment. On the other hand, this was enough to discover a novel bug in Python’s standard-library lib2to3 refactoring tool!

Finally, in versions 0.0.3 and 0.0.4 I pushed these filters down into the grammar - any production that could be checked for validity was checked as it was generated. Localising rejection sampling like this made generation far more efficient, and enabled generation of larger outputs for the first time.

At this point I had spent two or three days working on Hypothesmith. The implementation - not counting the grammar file, or Hypothesis - was exactly 100 lines of code including whitespace and comments. I’d discovered and reported four novel bugs upstream to CPython and Black, and went back to other projects.

Small but significant upgrades

These improvements consist mostly of taking a quick hack that I added to get the early versions working, and allowing for the generation of a much wider range of inputs. To put it another way, sacrificing development time instead of completeness while maintaining soundness.

  • Calculating an exact regex for valid Python identifiers, instead of the ASCII-only version I had used.

  • (TODO) Generating the number of spaces for each “indent token” independently, instead of always using four spaces. e.g. one function might use two spaces for the first indent and three for the second, while the next might use four and then one.

  • A quick-and-dirty hack inspired by swarm testing: only generate comments and ignored whitespace for ~1/4 outputs

Composing with other Hypothesis features

When I proposed using Hypothesmith to test the new parser at the 2020 Python Language Summit (blog posts forthcoming), as part of a general propsal to adopt property-based testing, the reception was generally positive.

Parser-related issue: https://github.com/we-like-parsers/cpython/issues/91 The new and old parser had already been equivalence-tested against the stdlib, and the ~4k most popular Python packages, flushing out many span-tracking bugs in the old parser.

I’m confident that Hypothesmith generates source code quite unlike that found in any well-maintained open source Python package, so adding it to the test suite seems likely to turn up novel bugs. The initial demo didn’t impress though, with a complete failure to generate any with-statements in thousands of example programs (changed handling of which is one of the few user-visible improvements from the new parser).

I therefore experimented a little with targeted property-based testing, and discovered that while targeting the length of generated source code was not helpful (how much trailing whitespace would you like?) more meaningful metrics made a substantial difference.

  • Number of instructions when the program is compiled to CPython bytecode

  • Number of nodes in the abstract syntax tree

  • Number of unique node types in the abstract syntax tree

TODO: obviously I need to report some metrics here. Roughly: average of 2, 4, 4 -> 55, 73, 19. It’s a big improvement!

I don’t think the result is world-leading, but I have found the development process remarkably smooth and encourage researchers to give more thought to the user experience and composability of the tools and techniques they pioneer.

Alternatives

More swarm testing

Automatic swarm testing in from_grammar() - has been on Hypothesis’ todo list for a while now at low priority. Seems like an obvious win, though for performace reasons we’ll probably want to track the dependency structure of various productions to avoid (in the extreme case) disabling our start production and making no progress. After this though, naive grammar-based generation is probably about mined out.

Fuzzing techniques

Mutational fuzzing - especially recombination of existing program fragments seems very promising but I don’t good Python tools for that easy to hand. The Fuzzing Book has some nice stuff on this I should reread for informed comparison.

CST-based generation

Generating and unparsing a concrete syntax tree - very promising, see https://github.com/Zac-HD/hypothesmith/issues/2 - especially if paired with the typed-ast and typeshed projects it might be reasonably easy to generate well-typed programs which are safe to execute.

Neat synergy: from_type() on type-annotated CST nodes mean we get most of the tree for free, and can focus on constructing the invariants we care about. This is exactly what I hoped automatic-from-type would let us do

Ultimately, I expect that CST-driven generation will be the only possible path to dynamically-conforming Python programs; but that in turn will bring differential compiler testing to an incredibly popular interpreted langauge.

What about (syntactically) invalid source code

I have deliberately avoided the question of generating invalid program source code. That’s because it’s almost too easy - just generate arbitrary text(), or replace (or move, or delete, etc.) a substring of an otherwise valid program, etc.

There’s certainly some interesting work to be done on how to realistically break valid programs though.

Generating as valid + perturbation also means that shrinking will continue to work really nicely - otherwise you’d probably get an “inverse threshold problem” where something that only just fails minimises to something wildly invalid.

Evaluation

In the shake-out-a-pile-of-bugs mode, reproduce https://www.drmaciver.com/2015/03/27-bugs-in-24-hours/ with hypothesmith. https://github.com/afd/hypothesis-ecoop-2020-artifact/blob/master/external-reducer-evaluations/src/reducereval/experiments/formatting.py

formatters: black, autoflake, yapf, ast.unparse, pyupgrade, pyformat?, others? linters: flake8 and various plugins, mypy, pylint, vulture, mccabe,

Find cases where
  • linters throw internal errors

  • formatters introduce new linter errors.

  • formatters take more than one pass to reach a fixpoint.

  • composing formatters breaks the fixpoint property (technically allowed, but probably bad)