Augumenting generational testing with feedback-guided mutation

This is the paper with all the intensive details of HypoFuzz evaluation; pitched as offering generalisable lessons. I’ve implemented many techniques from the fuzzing literature, and so this is where I do the huge evaluation of how they contribute to fuzzing performance in this unusual setting, including which are complementary/conflicting/substitutable/etc. The closest other thing I know of would be AFL++. I conclude that there should be closer cooperation between fuzzing and PBT/random testing communities.

Abstract

Since the AFL fuzzer was released in 2014, feedback-guided mutational fuzzing has been a fertile and product field for testing research. However, constructing harnesses to fuzz high-level and structured applications like compilers or network protocols with a bytestring-oriented fuzzer remains challenging. An alternative technique, and the earliest kind of fuzzing, is random generation of test inputs. This can be as simple as Miller’s original fuzz utility (random ascii characters), or as complex as generating conforming C programs with CSmith.

Property-based testing tools like Hypothesis offer a unifying insight: making a sequence of random choices is equivalent to parsing a stream of bits, which is usually supplied by a PRNG but can alternatively be supplied by a fuzzer such as HypoFuzz. But how much does feedback-guided mutation add to sophisticated random generation?

I collect a corpus of NNN open-source projects tested with Hypothesis, totalling NNN property-based tests, and provide scripts to load each or all of these highly-structured targets and associated input generators for experiments on structured and hybrid fuzzing.

I use NNN fuzzer configurations to evaluate published techniques for scheduling, seed selection, mutation operators, coverage metrics, ensembling, additional feedbacks, and targeted fuzzing. I find XXXXXXXXXXX. As released, HypoFuzz uses augumented generation to discover new branches with X-YY times fewer inputs and discovers NNN unknown bugs solely by running existing upstream tests.