Zac Hatfield-Dodds, the Assurance Team Lead at Anthropic, speaks with host Gregory M. Kapfhammer about property-based testing techniques and how to use them in an open-source tool called Hypothesis. They discuss how to define properties for a Python function and implement a test case in Hypothesis. They also explore some of the advanced features in Hypothesis that can automatically generate a test case and perform fuzzing campaigns.
- Zac Hatfield-Dodds’ Web site: Zac HD
- Documentation for Hypothesis:
- Web site for Hypothesis:
- Documentation for the Hypothesis GhostWriter:
- GitHub repository for Hypothesis:
- Documentation for HypoFuzz:
- Web site for HypoFuzz:
Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Greg Kapfhammer 00:00:18 Welcome to Software Engineering Radio. I’m your host, Gregory Kapfhammer. Today’s guest is Zac Hatfield Dodds, an Australian researcher, software engineer, and open source maintainer. Zac leads the assurance team at anthropic and AI Safety and Research company building reliable, interpretable and steerable AI systems. With the goal of making testing easier for software developers, Zac maintains open source projects like Hypothesis, Pytest and HypoFuzz. In addition to being a speaker at conferences like PyCon, Zac was recognized as a fellow of the Python Software Foundation. Welcome to Software Engineering Radio, Zac.
Zac Hatfield-Dodds 00:01:02 It’s great to be with you Greg.
Greg Kapfhammer 00:01:04 Thanks for joining us today. We’re going to be talking about property-based testing and how developers can create property-based tests in Pytest. We’ll be learning how to do that in an open source tool called Hypothesis. Zac, before we dive into the technical details, can you give us a brief introduction to property-based testing and how it works with the Hypothesis tool?
Zac Hatfield-Dodds 00:01:27 Absolutely. So when I think about writing software tests and the kind of unit test style, maybe they’re testing small units, maybe they’re testing whole programs. There are basically two problems that you have to solve in order to have a software test. The first is that you need to work out what kind of input data you run, and the second is that you need to check in some way that your code didn’t do the wrong thing. In Python, that usually means you run a test function and the test is considered to fail if it raises an exception or in unit test if it calls a particular assert method with a false value. And the second part is coming up with the test inputs that actually elicit that behavior that might be buggy. And so property-based testing gives us a different attitude to what we check and a much richer set of tooling to help computers generate those inputs for us.
Greg Kapfhammer 00:02:18 So one of the things that I’m learning is that as a tester you have to think about both the inputs and the checks and it sounds like Hypothesis helps us to do that in an automated fashion. Can you share a success story associated with the use of property-based testing and the Hypothesis tool?
Zac Hatfield-Dodds 00:02:37 One of my favorite case studies is from a friend of mine who some years back was teaching a machine learning course. This was long enough ago that PyTorch was not yet installable on Windows. And so to help his students get started, he wanted to have an auto diff library that could be used for simple machine learning problems but was much easier to install. And he credits Hypothesis with making that possible. He says that writing an advanced and correct library that covered all of the edge cases would’ve been impossible for one person to do before the course started if he hadn’t had these powerful techniques for testing it.
Greg Kapfhammer 00:03:12 So it sounds like property-based testing and Hypothesis helps us to do things like narrow in on the edge cases in our program under test. I’d love to learn more about property-based testing later in the show, but in order to ensure that we set a firm foundation, can you give us a brief definition of a few software testing concepts? So to get started, for example, what is a test case and a test suite?
Zac Hatfield-Dodds 00:03:37 Great questions. So there are a couple of pieces here. When we say test suite, what we usually mean is all of the tests that we have. So if you have your CI system set up to run a bunch of tests on each pool request, for example, all of the stuff that gets run collectively is your test suite. A test function is then one particular functional method that’s part of your test suite. And a test case is one particular input. So you might have a parameterized test, sometimes people call them table driven tests where you write down many inputs and outputs. We would say that that is a single test function, a part of your test suite and that the table contains many test cases. The thing you are testing is usually called the system under test, sometimes function under test in hardware, people would say device under test.
Greg Kapfhammer 00:04:31 That makes a lot of sense. Can you tell us what’s the meaning of the term example based test case?
Zac Hatfield-Dodds 00:04:37 So an example based test is pretty much defined by contrast to property-based tests. A standard example based test is where you write down a particular input to your code. Then usually you call your code on that input and most often you then assert that the output you actually got is what you expected. And so because you are thinking up this single example for yourself, we often call that an example based test.
Greg Kapfhammer 00:05:03 Okay. So would you agree that many times when people use Python testing tools like unittest or Pytest, that example based testing is the standard approach to writing test cases?
Zac Hatfield-Dodds 00:05:17 Yeah, I do think it’s the most common approach to testing in this style.
Greg Kapfhammer 00:05:21 Okay. So if listeners wanna learn more about this common approach to testing, they’re encouraged to check out episode 516. It was called Brian Okken on testing in Python with Pytest and you’ll learn a lot more details. But what I’d like to do now is to dive into our discussion of property-based testing with Hypothesis. So you mentioned a moment ago and we discussed example based testing. Can you comment clearly on the downsides associated with this approach?
Zac Hatfield-Dodds 00:05:51 Absolutely. So the first thing I do wanna say before I get into the downsides is that I think almost every good test suite will involve some or maybe many example based tests. There really is a lot of value in pinning down the exact behavior in particular edge cases or even randomly selected examples. The problem with basing your entire test suite on example based tests is just that it takes a long time and it gets really tedious to write them. When I hear people talking about testing, people often and I think correctly say things like, “This just feels really tedious. I feel like I’m writing the same code over and over. I’m just not really sure that this test that I’m writing actually delivers much value to the organization or the project that I’m working on.” And so I think most of the downsides are a combination of requiring a lot of manual effort to define a rigorous test suite like this. And also that it’s really easy to miss the edge cases that you didn’t think of. And the edge cases that we didn’t think of when we’re implementing the code are likely to be exactly the edge cases that we don’t think of when we’re testing the code.
Greg Kapfhammer 00:06:53 I really resonate with what you said, Zac. I too have found testing in the example based fashion a little manual and tedious and I can definitely tell you that I’ve overlooked edge cases many times. So to make our discussion of property-based testing more concrete, let’s take an example of a program and then explore some of its properties. One thing that comes to my mind is that it’s frequently the case that programs have to load in some data and or save some data. Can you comment on how property-based testing might be applied in this general circumstance?
Zac Hatfield-Dodds 00:07:30 This is actually one of the best places to start with property-based testing precisely because it’s a task that almost all programs have to do. You know, as the saying goes, if your function is a pure program, you don’t have to run it to get the lack of output. The other reason that this is particularly useful to test is because saving and loading data tend to cross many different layers of abstraction. When you call a save function in a high level language, you’ll be dealing with many layers of a software stack all the way down through native code, device drivers, the operating system, and even the hardware. And so it’s both a simple invariant to express to say, if I save data and then I load what I saved, I should have the same data. But it’s surprisingly complex to implement and that combination plus the presence in almost every program makes it a great place to start property-based testing.
Greg Kapfhammer 00:08:21 Okay. So to be clear, can you say again what is the property for this function or program under tests that we’re talking about now?
Zac Hatfield-Dodds 00:08:29 The property here is what we would often call an invariant. Something that should always be true no matter what test case or input example we decide to try the test on. And that invariant is for any input test case. If we save that and then we load the thing that we saved, we should get back an equal object or an equal value to the thing that we saved.
Greg Kapfhammer 00:08:52 Okay. So what’s the benefit of thinking about testing this program and using property-based testing with that invariant or property in mind?
Zac Hatfield-Dodds 00:09:02 So I think of this as having a couple of advantages. The first is that if you want to write tests that cover a variety of different inputs, it’s easier if you separate out the definition of those test cases from the code that actually exercises them. This is kind of a standard separation of concerns argument. I think that gets you as far as what we call table based testing or the Pytest parameterized decorator. But property-based testing goes a little further. Because we don’t have to think of all the test cases ourselves and we have computer assistance to generate those, the same test can find edge cases that we hadn’t considered. For example, maybe we forgot to think about what would happen if we tried to save an empty record or an empty string or list. Our property-based testing library is very likely to try that as a special case due to all the heuristics that maintainers like myself build in and that will discover the edge cases that you or I hadn’t realized we should check for.
Greg Kapfhammer 00:10:01 If I’m understanding you correctly, I think what you’re telling me is that Hypothesis which implements property-based testing will actually be creating these inputs for me. Is that the correct way to think about it?
Zac Hatfield-Dodds 00:10:13 That’s right. So what you do with Hypothesis is you write a test function which should work for any valid input and then you describe to Hypothesis what kind of inputs are valid. So you might say any list of numbers is valid or any instance of this type is valid or any object which matches this database schema or this open API website schema is valid. And then Hypothesis can try to generate many valid inputs according to that specification you gave it looking for one which makes your test fail.
Greg Kapfhammer 00:10:49 That sounds really useful and in a moment I wanna dive into the specific ways in which Hypothesis automatically does that. Before we do so, can you comment briefly what are the practical challenges or fundamental limitations that we as developers may face when we’re using Hypothesis to perform property-based testing?
Zac Hatfield-Dodds 00:11:10 Great question. So I think these challenges are not specific to Hypothesis as an implementation, but to the ideas of property-based testing in general. And when I talk to students I am teaching or other professionals at conferences, the two most common concerns I hear from people are first, that it can take a while to get used to the API for describing what kinds of inputs are valid. When I was saying things like lists of integers, this is a relatively simple case, but if you have highly structured data with complicated internal dependencies, it can take a while to get used to the API and learn how to generate that efficiently. And the second problem that I hear from many people is that they don’t quite know how to think about finding the invariants or those properties in their own code. If we have a business application for example, it’s not always clear what kind of invariants are there. And my advice is always to not worry about that too much and start by focusing on properties like the save, load, round-trip invariant that we talked about or given any valid input, my code should not crash. That’s kind of the simplest possible invariant, but it’s also in my code at least embarrassingly effective.
Greg Kapfhammer 00:12:23 It sounds like what you’re saying is we shouldn’t get caught up in picking the right invariant and instead we should charge ahead with an invariant that we understand and see if Hypothesis can find bugs based on that simple invariant. Is that right?
Zac Hatfield-Dodds 00:12:37 At least to get started. I think people often have this kind of paralysis where jumping in is the way to go. And I find once you’ve written the simple tests, the tests that apply to almost any project, it usually becomes clear what other properties you could test and you’re much more comfortable with the libraries and the techniques at that point.
Greg Kapfhammer 00:12:56 Okay. So with that challenge, let’s go ahead and dive into the specifics about how Hypothesis works. Now, bearing in mind that Hypothesis is a Python program, maybe we should start, how do you install Hypothesis and set it up in a virtual environment?
Zac Hatfield-Dodds 00:13:13 So without trying to describe how to set up your preferred Python environment Hypothesis is available from the Python Package Index. So you can simply do a PIP install Hypothesis or if you prefer the Conda Package Manager, you can also Conda install Hypothesis from the Community Conda-Forge channel.
Greg Kapfhammer 00:13:30 Okay. Now one of the things I’m curious about is Hypothesis a library that I use in my test suite or is it a command line interface tool or is it both of those?
Zac Hatfield-Dodds 00:13:41 We ship both of those. So the main use is as a library within your code it just helps you write better test functions which you can then execute with Pytest or with the standard libraries: unittest, TestRunner or you can call Hypothesis test functions by hand if you would like to. And then we also provide a command line tool as part of the library which offers things like test generation that we’ll talk about later and also automated refactoring tools for when we occasionally have to deprecate old options.
Greg Kapfhammer 00:14:12 That sounds neat. So let’s start with the basics in order to get up and running with Hypothesis. I think at the beginning we’re going to have to import the Hypothesis module and then there are things like a Given and a Strategy. What is a Given and what is a Strategy?
Zac Hatfield-Dodds 00:14:30 Alright, please forgive me listeners for trying to verbally sketch the source code to you. For those who are less used to Python, Python has this neat syntax called Decorators. They allow you to apply a function to a function and so Given is a Decorator, it takes what we call Strategies. Strategies are the way that we describe what kind of input data you want and the Given Decorator wraps your test function with a helper that uses those strategies to generate examples and then internally calls your function many times searching for a failing example.
Greg Kapfhammer 00:15:05 Can you tell us a little bit more about certain strategies that are a part of Hypothesis? For example, I know there’s an integer strategy or a text strategy. What do they do?
Zac Hatfield-Dodds 00:15:17 So the fundamental thing that a strategy does is it describes to the Hypothesis engine, which is a kind of internal API, how to generate examples of a particular kind. And in some cases that’s a particular data type. So for example, the integer strategy by default can generate any integer. It’s that simple. But the integer strategy also accepts optimal minimum and maximum value arguments. So you can say I want any integer which is greater than three or I want integers but only between zero and a hundred. And so it’s not necessarily type-based. You don’t have to say my function has to handle any integer. You can say actually I only want to test this particular range for this test function. And so as well as supporting generating particular atomic elements, Strategies include things like lists where you pass a strategy for how to generate each element of the list and option lease and bounds on the size of the list and whether you want it to be unique and so on. We have Strategies which combine other Strategies, such as a union or an OR operator. So you can say I would like integers or text and we have things which kind of infer the data from a schema. So you can say for example, I would like strings that match the following regular expression and we’ll do our best to generate those for you.
Greg Kapfhammer 00:16:39 So how does a developer decide which strategy they’re supposed to use?
Zac Hatfield-Dodds 00:16:45 This is something that I think comes a little with experience, but it also comes with the definition of your test function. So when we’re thinking about, for example, if we wanted to test division, we would think about, well what are all of the possible inputs for which division should not crash? We’d say, well we can divide two numbers. So each of our two numbers might be integers, they might be floating point numbers. If you got fancy you might say we also wanna allow the fractions class for the standard library. And so we would say, okay, X is an integer or a float or a fraction and Y is an integer or a float or a fraction. And Hypothesis will pretty quickly tell you about zero division errors.
Greg Kapfhammer 00:17:27 Okay, I see what you’re getting at. So I agree that it’s difficult to discuss these topics without looking at specific source code, but it sounds like one of the things that you’re explaining is that I’ll know what strategy to pick based on what I want to test and what are the inputs to the function under test? Is that the right way to think about it?
Zac Hatfield-Dodds 00:17:47 Yeah, I don’t wanna pretend that it’s always easy. This is one of the two things that I talked about people finding hard or taking a while to learn earlier. But I do think that once you’ve thought about what you are testing, there’s two parts to this. The first is working out what kind of inputs you want or what the domain of valid inputs looks like. And the second is thinking about how do I express that in Hypothesis? And we’ve worked really hard to try to make the second step as easy as possible though it’s not always as easy as we would like it to be.
Greg Kapfhammer 00:18:17 I remember you saying before that Hypothesis works with unittest and Pytest. Can you talk a little bit about how you would actually combine all of these things into a Pytest test case and then into a test suite?
Zac Hatfield-Dodds 00:18:32 Absolutely. So combining it into a test suite is really easy from Hypothesis perspective, we don’t know anything about running tests. What we let you do is define more effective test functions or methods. And so Pytest is responsible for finding that and executing it as part of your test suite. So the way we would define this is we would start by writing a pretty standard unit test. For division, we might do DEF test division. We’ll say it takes arguments X and Y and the body of that function just does X divided by Y. We’re not gonna assert anything, we’re just checking whether doing this operation can ever crash. And then at the top we’re going to add before the definition at Given, so we have that Decorator and we’ll pass it arguments X equals integers or floats and Y equals integers or floats. Those arguments to Given which are Strategies will then be used by the Hypothesis engine to pick particular values. And then those values will be used to call the test function that we’ve wrapped with many different arguments until we find some that make it crash.
Greg Kapfhammer 00:19:38 If I was a developer and I was using Hypothesis and I now have written a test and it’s property-based, how do I actually run it on my laptop or how do I run it in a continuous integration environment? Can you explore those details?
Zac Hatfield-Dodds 00:19:53 Sure. The simple description is that you run it in exactly the same way that you would run any other test function defined in the same place. So the decorator simply returns a function that you can call exactly like you call a normal python function and that can be executed on your behalf by Pytest for example, by running the Pytest command and letting it be collected or if you’ve defined a unittest method by using your usual unittest application.
Greg Kapfhammer 00:20:19 So that sounds really exciting. I think what you’re telling me is that if I already know how to run Pytest or unittest, Hypothesis will directly integrate into my workflow. How cool is that? What I want to do now quickly is to talk about the things that may happen when I’m running my test suite and it could be in CI or it could be on my developer workstation. But let’s suppose that a test case passes. What do I know when a test case that is property-based passes when it is run?
Zac Hatfield-Dodds 00:20:50 So what you know is that the Hypothesis engine has tried to generate many examples by default. We generate a hundred examples that you can increase or decrease that number if you prefer and none of them made your test fail. If you want to see exactly which examples we generated, there’s a verbosity setting. So you can turn that up and watch the exact inputs. You can also declare particular inputs that you want to test every time using an at example Decorator.
Greg Kapfhammer 00:21:17 If I’m running my test suite and then I notice that a test case fails, what will Hypothesis tell me and how will I then use that information to debug my program?
Zac Hatfield-Dodds 00:21:28 That’s a great question. So those of our listeners who have done randomized testing before might be feeling nervous here. Randomized testing usually has two disadvantages that Hypothesis works to mitigate. The first is that a random test input might be very large or very complicated or hard to understand in various ways. And Hypothesis tries to deal with this by presenting minimal failing examples. Once we find an input that makes your test fail, we try variations on it to be as small and as simple as we can find and then we report that minimal failing example. The second problem that people are often concerned about is flakiness. You want to be confident that if you run your tests and they fail, then you try to fix the bug and you run your tests again and they pass. You really don’t wanna wonder, did I really fix the bug or did my tools just not find the bug this time? And so to mitigate that Hypothesis actually saves all of the inputs the underlying way that we generate them in a little local database. And so when you rerun your tests, we actually start by trying to replay any historical failures and it’s only if all of the previous failures now pass that we go on to randomly generate more.
Greg Kapfhammer 00:22:39 So is running my test suite on a developer workstation different than running it in CI when it comes to saving those sample inputs or does it work in the same way?
Zac Hatfield-Dodds 00:22:49 So if you are running in CI, people will often use fresh virtual machines which were set up to order for that CI run. In that case, it’s often worth configuring your CI pipeline to save that database, cache it between runs, and in cases like GitHub actions, Hypothesis actually ships with tools specifically to make that easier and even to share that with local developer workstations by transparently pulling down any failures found on CI to replay locally. If you’re in a workplace where you have things like servers or a database on the network, we also have pretty easy tools to wrap that into the database and store it in network attached storage instead so that everybody who runs the test shares those failures between them.
Greg Kapfhammer 00:23:35 That sounds really interesting Zac. One of the things I learned about Hypothesis is that it supports something called “input shrinking”. What is this and how does it help us as developers?
Zac Hatfield-Dodds 00:23:46 Great question. So input shrinking is the process by which we find those minimal failing examples. If you’ve ever used Stack Overflow, you’ll know that people love minimal failing examples because they help us narrow in on the parts of our code or the parts of the input that actually matter without extraneous detail or complexity that we would otherwise have to discard when we started debugging. And the way we do this is we start with some particular input or test case and it’s called shrinking because the first thing we try is to remove parts of it to find the smallest example possible. And so Hypothesis has a whole set of different modifications that we could try making integers smaller, making lists shorter, trying to sort different elements. And we apply each of these in various intelligent ways until we’ve found an input such that none of our modifications can make it smaller anymore. We call that a minimal failing example ’cause it’s the smallest we can find, even if it might not be the smallest input that actually exists that causes that same failure. And then we report that to the user to start their debugging process from a simpler case.
Greg Kapfhammer 00:24:54 I wanna pick up on minimal failing examples again later in our show. But before I do that, I think we should dive into some of the advanced features that Hypothesis provides. For example, I know that Hypothesis has a feature that is called Hypothesis Write and it is sometimes called the ghost writing feature. What is this feature and how would I use it, Zac?
Zac Hatfield-Dodds 00:25:17 This is one of my favorites, especially for an interview like this because I wrote the Ghostwriter after being frustrated at a Python conference a couple of years ago. Hypothesis was just starting to get popular and I was really excited when I bump into people in the hallway and they’d say, “oh Hypothesis!” because I had it on my badge, “that seems really cool. I’ve heard about that”. And I would get really excited and I’d say, “that’s awesome, what do you use it for?” And they’d tell me, “oh, I’m not sure how to apply it to my code. So I don’t use it, I just think it’s really cool”. Ah, like many software engineers, I just believe deep down that some social problems have technical solutions and so I was determined to try to write some code that would help people get started with Hypothesis. And so that’s where the Ghostwriter comes from. What it does is, it’s a command line tool. There’s also the Python API if you wanna build something on top of it, where you point it at some piece of Python that you’ve written module or a library or a script and it will load that and inspect all of the functions and try to write the best property-based test that it can for each of those functions.
Greg Kapfhammer 00:26:22 So when I use this ghostwriting feature, it sounds like it’s actually writing the complete Pytest test case for me. Is that right?
Zac Hatfield-Dodds 00:26:30 That’s right. You can point it in the module. If you say for example, Hypothesis write NumPy, it will go and load a few hundred functions from the NumPy library and inspect each of them and write the source code for a test for every single one of them.
Greg Kapfhammer 00:26:44 Alright, let’s pause for a moment. Can you tell us a little bit about how Hypothesis actually does this automatic creation of the test code and the test inputs? What’s happening here Zac?
Zac Hatfield-Dodds 00:26:56 Well I did this in 2019, so if you’re thinking it must be AI, this is the old kind of AI, the kind that involves if statements, and dictionaries, type inference, and lookup tables, and substituting in string variables for things. In some sense it’s a bit of a hack job. But what I do is for each function, you know, we will be pointed at some module and we’ll load that module and for each function we’ll kind of look at it and we’ll say, does the name of this function suggest that there might be a round trip with another function in the same module? For example, if your function is named save, we’ll also look in that module for a function named low. If it doesn’t match any of these patterns, we’re just gonna try calling that and checking whether we can get an error to occur.
Zac Hatfield-Dodds 00:27:40 And then the second thing we do is we look at all the arguments to that function or pair of functions and if have type annotations you know X:int in Python, well Hypothesis knows how to generate integers. We have this Strategy which will generate instances of a type. And so we’ll create that strategy and then we’ll look at its representation which kind of tells you how it was defined and we can just substitute that into the source code. And then in some cases we don’t have type annotations, but we do have variable names. So we also have a whole bunch of custom logic based on regular expressions and big lists and a statistical analysis I did of a few hundred million Python source files last year, which will say, well if you have a parameter called Alpha, this is probably a floating point number and it’s likely to be intended to be in a small range.
Zac Hatfield-Dodds 00:28:32 Or if you have a variable called Name, this is almost certainly meant to be a string. And so in many cases, even with our type annotations, we can still guess what the argument should be. And in some cases where we don’t have any other information, we kind of follow the zen of Python, which says “when in doubt, refuse the temptation to guess”. And so in those cases we’ll actually leave a placeholder with an explicit to-do comment explaining to the user what kind of action they need to take to complete this test source code that we’ve just written for them.
Greg Kapfhammer 00:29:03 The Ghostwriter tool is going to be able to automatically generate the source code and the inputs. It makes me think a little bit about other automated test data generation tools that I’ve used in the past. For example, the Penguin tool uses genetic algorithms to evolve test cases and inputs or you could use a tool like GitHub co-pilot and it would also automatically create test cases. How does the Ghostwriter in Hypothesis compare to those tools or to other tools?
Zac Hatfield-Dodds 00:29:34 Great question. So I think it’s a very different kind of tool to Penguin. The Penguin tool tries to generate a minimal test suite which gets high coverage. So it will take the functions from your source code and try executing them on different values. But the assertions that Penguin ends up making are just that your source code has its current behavior. So it’s not really testing the invariants and it doesn’t really think at all about the semantics of your code. And so it’s impossible in principle for Penguin to tell you that you have a bug, because the test suite that it produces will actually just assert that your source code does what your code currently does. That can be useful if you need a regression test suite or if you want to check that a re-implementation or a refactoring is still equivalent to your previous version. But I think it’s less useful for getting people started than Hypothesis. The variable names and so on are also just completely meaningless in this approach. And relative to the Ghostwriter, sorry to GitHub copilot, I think the Ghostwriter is somewhat better at understanding semantics and because it uses Hypothesis much more likely to find those rare edge cases that humans wouldn’t think of or might not have been in the training data for tools like copilot.
Greg Kapfhammer 00:30:49 It sounds like each of those tools may have a place in our tool belt, but let’s continue to focus on how Hypothesis can specifically help us. I know another thing that Hypothesis can do is something that’s called fuzz testing. Now if listeners wanna learn more about fuzz testing, I would encourage them to check out software engineering radio episode 4 74. With that said, Zac, can you give us some insights into what is fuzz testing and then what are the inputs, outputs and behaviors of fuzz testing mode in Hypothesis?
Zac Hatfield-Dodds 00:31:22 Absolutely. So Fuzz testing is the grandfather of randomized testing techniques. The first paper on it was published all the way back in the 1980s where the authors tried to test Unix utilities, the kind of coreutils built into all of our shells by piping in random data, just truly random data. And what they found is that a shocking number of them would crash when exposed to this random data. More recently people have found that using evolutionary approaches helps you find really rare inputs. For example, corrupted jpegs from image parser library and Fuzzing has thus become a kind of critical security research tool. The interesting insight here is that that same evolutionary approach can be used as a backend for property-based testing. So Hypothesis, by default, will generate the inputs that we try using a kind of random number generator and a big stack of heuristics. And in the case where we have a unit testing style workflow where you want each test to run very quickly, it turns out that using fancier techniques or feedback or evolutionary search just doesn’t pay for itself quickly enough to improve performance in CI or in local tests. But if you want to run your test suite overnight or over a weekend, then we’re actually running for long enough that these techniques pay off.
Greg Kapfhammer 00:32:41 So it sounds like what you’re saying is that I would use Fuzz testing mode in a different way or in a different time than I would use the standard Hypothesis property-based testing that we talked about before. Can you explore that in a little bit more detail?
Zac Hatfield-Dodds 00:32:57 That’s right. So I think of this as being a different part of a testing workflow. Most people are familiar with a workflow where you run tests locally, maybe the whole test suite, maybe just one or two tests for a bug you’re currently investigating and you run your tests in CI perhaps for every pool request and often on a regular basis for nightly tests or hourly tests to make sure that the current state of the repository still works. Fuzz testing with Hypothesis or HypoFuzz, which is our custom Fuzzer, is a compliment for that. We say we don’t wanna run an hours long Fuzzing campaign every single time we change our code, but maybe for example, overnight every day we’ll spend eight or 12 hours of compute time searching for new bugs. And because the way we find that is just we find the same input which Hypothesis could in principle have found but was much less likely to find if we run for a long time. We actually share that by putting it in the database. So when you rerun your tests, the test just immediately finds that failing input replays it from the database and fails with this very, very rare counter example
Greg Kapfhammer 00:34:05 Aha. This means that when you run Hypothesis in Fuzz testing mode overnight, it will ultimately give us some feedback the next time I run Hypothesis on my laptop as a developer. That sounds like a useful feature. Can you share an example of where that kind of synergy has actually found a defect in a real world program?
Zac Hatfield-Dodds 00:34:25 I found pretty regularly as I’ve been developing HypoFuzz, this Hypothesis based Fuzzer, that using it to run existing test suites, things I’d already had running in CI and locally for months or even years would find inputs which I just had never found through this black box style of testing. It turns out that the feedback guidance makes it very, very effective at finding bugs where multiple conditions have to be met for the bug to trigger, because using this coverage feedback will notice when each of those conditions are met independently and then try variations on the input that meet each of those conditions.
Greg Kapfhammer 00:35:01 I know that there are a number of other fuzzing tools, some of them aren’t customized for the Python programming language but are rather for binaries that may be implemented in say C or other programming languages, can you compare Hypothesis and its Fuzzing mode to things like the American Fuzzy Lop or AFL tool?
Zac Hatfield-Dodds 00:35:21 AFL is one of the best known Fuzzers around, the other being the libFuzzer based on LLVM. So these tools are usually designed for use on native code, so written in C or C + + or other languages that compile with LLVM and they usually assume that the person running the Fuzzer is a security researcher rather than a developer of the program. So they typically take a complete program and have the equivalent of like one integration test, which will take the kind of input that that program would take just as a bite string or as the contents of a file and try variations on that. So there are a couple of differences in the implementation as well because HypoFuzz has two advantages. The first is that because we’re integrated tightly with Hypothesis, while it doesn’t have these advantages on C code, we do understand the structure of the inputs to a much greater degree.
Zac Hatfield-Dodds 00:36:17 So we can do much more targeted mutations or modifications of those test inputs in a way that makes us much more efficient. And the other is that because we’re designed for test suites, we can actually say instead of having one Fuzz harness for a single target, we could say, say we have a thousand test functions, we don’t need to run each of them on a separate CPU core. We’d actually run each of them for a few seconds at a time and measure how much progress we’re making, how quickly we’re discovering new behavior and kind of do a dynamic optimization problem to maximize the overall rate of bug discovery.
Greg Kapfhammer 00:36:55 Before we move on to the next topic, maybe we can draw this all together. Overall, what are the benefits and limitations associated with using Fuzz testing in Hypothesis?
Zac Hatfield-Dodds 00:37:07 I think the benefits are that you find bugs, which you would be very unlikely to find otherwise. The costs or the limitations are that this doesn’t guarantee you’ll find bugs. You can still only find bugs if there is an input that would make your test fail and reveal a bug that way. And using this additional tool just takes a bit of additional work, which isn’t always worth it. But I think if you are running Hypothesis as a team and your team has access to a server, you should really try it out.
Greg Kapfhammer 00:37:37 Thanks for that response. I want to turn our attention now to another feature that Hypothesis has, it’s called Explain mode. And I think that there’s an interaction between the minimal failing input that you mentioned before and explain mode. So let’s dive in. What is Explain mode and how does it work?
Zac Hatfield-Dodds 00:37:57 This one’s fun. So as background, I was actually doing my PhD on Hypothesis and extensions to Hypothesis. So the Fuzzer was one and Explain mode is another. The reason I started trying to work on Explain mode was that a minimal failing example is not always maximally helpful. For example, we mentioned a zero division earlier, if we have this division test Hypothesis will report to us that the minimal failing example is zero divided by zero and that raises a zero division error. And the problem is if you don’t already know what a zero division error is and why it happens, you’re looking at this and you go like, well which parts of this input actually matter? I know it’s minimal, but is the problem that both numbers are zero, that the numerator is zero, that the denominator is zero or something else? And so Explain mode,
Zac Hatfield-Dodds 00:38:46 One of the things it will do is after we shrink to a minimal failing example, we’ll try variations on parts of that example and then we’ll report the parts that we could vary without changing the result of the test. So in this case, we’ll say, given this example zero divided by zero, we could try any value for the first value part of this. So zero or anything else divided by zero is a zero division error. And when you generalize this to more complex inputs, it turns out that in my experiments, maybe like 20 to 30% of the time, this takes a report from something that I have to investigate a little to understand what’s going on to something where I just look at it and I’m like, oh, I know exactly what’s happening here. It’s obvious now. And the other part of Explain mode is we run under coverage and so we can report to you which lines of code were executed by every failing example, but no passing examples. And so knowing this kind of divergence can be pretty useful in cases where trace backs would usually be preferred, but we don’t have them for some reason. Often that happens when our test says calculate the value we got, calculate the value we expected, assert the rate they’re equal, the trace back will point to that assertion, but this explain mode can point to the internals of the code where things actually changed.
Greg Kapfhammer 00:39:59 That’s really interesting. You mentioned a moment ago the idea of coverage. What is coverage and can you explore further how that’s useful in Explain mode?
Zac Hatfield-Dodds 00:40:10 Absolutely. So coverage is the idea that we can observe which lines of code were executed by our tests. If you hook it into Hypothesis, we can actually observe for each test case, each input that we try, what lines of code were executed by that input or by your code when processing that input. And so knowing that for each of the passing and failing inputs can help us narrow down which parts of your code seem to be related to the reason your test failed.
Greg Kapfhammer 00:40:38 Overall Hypothesis has a lot of really useful features and you’re one of the main maintainers and developers of Hypothesis. My next question is as follows, when you’re building a new feature like Explain mode or Fuzz testing mode, how do you evaluate that and find out whether it is effective and useful to developers?
Zac Hatfield-Dodds 00:41:01 That’s a great question and it’s actually harder than it sounds. Evaluation is famously difficult. So one thing that I usually do is just kind of think about when I’m looking at the questions people ask on Stack Overflow or the issues that people open on our GitHub repo or when I’m talking to people at conferences, I’m always trying to listen for what are the confusions that people have, what are the challenges that people face when they’re testing? And then I try to think, is there something I could build that would help people with this? And so I’m really passionate about clear documentation, but even more than that, helpful error messages because an error message will help someone who didn’t even read the documentation in each of these cases. Then for explain mode, I largely justify it by trying a few experiments myself. I am myself a Hypothesis user as well as the maintainer and so I can often go, alright, I hit a couple of bugs lately, let me try out this new variation and see if that would’ve helped me in those cases. Or do I think that this would have addressed the complaint that someone told me over a beer at a conference last year? You can also do more formal experiments, but usually those take too long to be useful while you are developing the tool in the first place.
Greg Kapfhammer 00:42:14 You hinted at something that I find really interesting. Can you share another story about how you actually use property-based testing to develop Hypothesis?
Zac Hatfield-Dodds 00:42:25 Pretty much the same way I’d use it to develop any other tool. When I can think of some property or some invariant, I’ll often write a property-based test that says, here are the range of valid inputs. Hey Hypothesis, can you find an input that makes this code fail? For example, is there any kind of input that doesn’t save and load from the database correctly? Is there any input that breaks this particular internal helper function that I have? If I go through every module that I have installed, does the Ghostwriter work for all of them or does it crash sometimes?
Greg Kapfhammer 00:42:59 Those are good, interesting examples. And I think this leads us to the next phase of our show. I know Hypothesis is a tool that’s been around for a number of years and that there are numerous add-on packages and then additional tools that leverage the Hypothesis tool. So I’d like to talk about a few of those now. One of them that I found really interesting is called Hypothesis JSON Schema. Now in order to start, Zac, can you tell us what is a JSON schema and then more importantly, how does Hypothesis use JSON schema to support property-based testing?
Zac Hatfield-Dodds 00:43:36 Great questions. So a JSON schema is a draft internet standard that describes as a JSON document what kinds of JSON documents are considered to match. The reason this is interesting is it’s the basis for many standards for web API s schemas. So if you develop a website or an API that has an open API schema that used to be called a swagger schema or if you use the GraphQL schemas that Facebook invented, in each of these cases, being able to generate JSON objects, you know in Python, that’s dictionaries and lists and so on that match some schema can help us generate input data to test web APIs.
Greg Kapfhammer 00:44:16 This sounds like it would be really useful if I have some type of complex object that goes into my function under test and then I want to describe it using a JSON schema. Is that the way that this tool is normally used?
Zac Hatfield-Dodds 00:44:29 I do know people who are using it that way, but I think the most common use is actually via a tool that a friend and I collaborated on called Schemathesis. And Schemathesis is a Python library, but also a command line tool that can test any web API that has a schema regardless of the language that it’s implemented in. Because those web schemas will also say what the endpoints are, they’ll say what kind of data should be valid for each endpoint and also something about what each input should return. And so we can generate test data using a Hypothesis and using this JSON schema extension to actually see if there are any inputs which make your web service crash. We can also test security properties. For example, if you make a request as Alice, which requires authentication and then you replay it while authenticated as Bob, we can test that Bob should not actually get access to Alice’s private data.
Greg Kapfhammer 00:45:21 Wow, all of that sounds really useful. So I can now see how using Hypothesis JSON schema and the Schemathesis tool would help me to test a web API. I want to turn our attention to another kind of testing that’s supported by Hypothesis. There’s a tool called Hypothesmith, if I’m pronouncing that correctly, and I think it’s inspired by the C Smith library. So let’s begin. What is C Smith and then what is the Python and Hypothesis version of C Smith?
Zac Hatfield-Dodds 00:45:53 C Smith is a famous library developed for C compiler testing. The idea there is that if you could generate random valid programs, there’s the source code for a C program or with Hypothesmith, a Python program, you can then compile that program with different compilers and if your compiled program gives you different results from different compilers, at least one of them must have a bug. So this has turned out to be enormously effective and improved GCC and LLVM and many other C compilers. And so Hypothesmith was my attempt to build a similar tool based on Hypothesis to generate Python source code. It’s basically a proof of concept, it’s just enough there to demonstrate that the idea is sound, but it’s already been used and variations on it have been used by the Python core developers when they’re implementing new features or performance optimizations to make sure that they work in all cases and don’t leak state or cause problems.
Greg Kapfhammer 00:46:49 That sounds really useful, Zac. What I think you’re saying is that Hypothesis can be used by Hypothesmith in order to generate Python source code in order to actually test the C Python implementation. Can you give us a concrete example of where there was a new optimization in C Python and how Hypothesmith was used to test it?
Zac Hatfield-Dodds 00:47:14 Absolutely. I can even do one better than that. A few years ago, in Python 3.9, the C Python developers changed the parser that they were using, so it’s now what’s called a PEG parser. And as part of that they did a lot of testing, but they were not at that time using Hypothesis. They have since added Hypothesis to their own CI system. And so when I was testing it post-release, I actually managed to find using this tool an input that would make C Python in itself segfault on syntactically valid source code. It was a particular thing to do with backtracking around back slashes.
Greg Kapfhammer 00:47:48 Wow. So if I’m understanding you correctly, you were able to use Hypothesis and an inspired C Smith tool in order to find a real bug in the production version of the C Python implementation?
Zac Hatfield-Dodds 00:48:01 That’s right. I tested it during the betas and found a bug, which they treated as a P zero release blocking bug. They fixed that one, but then after release when I retested, I found another segfault.
Greg Kapfhammer 00:48:13 So I think that leads me really well to the next question that I wanted to discuss. It’s clear as you’ve been explaining the details about Hypothesis that some of its features have actually come from the software testing research literature and ultimately they lead to a tool that can find bugs in C Python. So Zac, can you explain to us what are some of the best practices that you’ve learned about when it comes to leveraging research and then ultimately building a prototype that finds high importance bugs in something like C Python?
Zac Hatfield-Dodds 00:48:48 This is a great question and trying to answer it is kind of the focus of much of my PhD actually. I think my experience is that much of the research literature is really focused on ideas about testing. So people are trying to get proofs of concept or prototypes or ideas that could be useful for software testing or finding or understanding bugs. And what I’ve found through my own work in open source is that often I have quite a different emphasis where I look at an idea and I say something like, well, this is neat if you have a very specific kind of problem. For example, if people are only using a subset of the language, they say, but Hypothesis users use all of the Python language, so I can’t do that version, but is there something inspired by this? You know, which would maybe give me a weaker tool but applicable to many more cases or many more users. And so I’ve found reading the research literature really valuable, but always with an eye to how would I make this work in practice? Is there a variation that would be applicable in more cases? And ultimately what kind of user interface or user experience would make this valuable to people who are trying to get their work done?
Greg Kapfhammer 00:49:59 I can tell that you really care about creating software testing tools that can enhance developer productivity. Now that we’ve talked a little bit about how to transition research into practice, can you share some general lessons about what we should do as developers when we’re creating software testing tools to aid ourselves and other developers?
Zac Hatfield-Dodds 00:50:21 That’s a tricky question because so often I think the answer is it depends. It depends on the context, it depends on the domain you’re working in. For example, bugs are much, much more expensive if you’re building medical devices or airplane navigation systems than if you’re building a website for a demonstration or a marketing stunt. And so I think sometimes as a discipline we overfocus on bugs and we say things like it’s really important that software be correct or there’s this kind of moral responsibility. And I do feel that we have a real responsibility, but that the responsibility we have as software engineers is not to pursue correctness in a narrow-minded way, but to think about the impact that our work has on society. And so I think using the best tools we can, thinking about the workflows that make us productive, and thinking about whether the kind of thing we’re building is actually something valuable for the world is really important. And then the specific questions about testing for me, the answers kind of fall out of that broader perspective about my work as a software engineer.
Greg Kapfhammer 00:51:28 So let’s pick a specific domain. I mentioned at the start of the episode that you lead the assurance team at Anthropic. First of all, can you tell us briefly about what kinds of software Anthropic develops and then as a follow on, are you applying property-based testing to the software developed at Anthropic?
Zac Hatfield-Dodds 00:51:47 Great questions. So Anthropic is an AI research and development company. We’re structured as a public benefit corporation, and our goal is to develop reliable, steerable, and trustworthy AI systems. Among other things we develop advanced language models. You can go to our website anthropic.com and play with the latest version of Claude, which was released the day we recorded this episode. When we’re writing the particular code, we do actually use property-based testing often wherever we have things with mathematical invariants, property-based testing is a fantastic way to get rigorous tests for those. I’ve also found that the ways of thinking about software testing that I’ve discussed can be really valuable in thinking about how we explore the behavior of advanced AI systems and make sure that they behave well in all or almost all cases.
Greg Kapfhammer 00:52:37 That makes a lot of sense. Thanks for sharing that example. As we draw our episode to a conclusion, I know we’ve now talked about example-based tests and property-based tests. We’ve also talked about Fuzz testing and other approaches to testing, and I remember you saying at the very start of our episode that all of these may be useful to us as software developers. So as we start to draw our episode to a conclusion, can you offer some practical advice about when a developer should use these various types of testing approaches?
Zac Hatfield-Dodds 00:53:09 Great question. So the way I think about it, almost every test suite should have at least some property-based tests in it. So if you currently have zero property-based tests, the usual place to start is thinking about those save and load round-trips. Where are places where you convert data between different formats and can you test that? Another great way to get started is to look for places where you have table-based tests. For example, Pytest parametrize, and think about whether you can convert that to a property-based test and get Hypothesis to generate some of those inputs for you. On the other hand, I’m certainly not an absolutist. I don’t believe that all of your tests should be property-based and depending on the project, I’ve seen anywhere between literally a single property-based test and the whole test suite, and about 80% of the test functions being property-based tests can be appropriate depending on the project.
Greg Kapfhammer 00:54:01 Thanks for sharing that example, Zac. We’ve now talked a lot about property-based testing and example-based testing, and we now know how to use Hypothesis and tools like Pytest. You have a call to action for our listeners if they want to get started and explore this area in greater detail?
Zac Hatfield-Dodds 00:54:19 Absolutely. Two separate calls to action. If you’re using Python: PIP install Hypothesis, run the Ghostwriter and see if you can improve on its output. I would bet you’ll look at that and go, no, that test is kind of stupid. You should be checking this. Or the inputs should actually look more like that. If you’re not using Hypothesis, look up the property-based testing library for your language. Some particularly popular ones are proptest for Rust or JQWIK for Java or QuickCheck for Haskell is one of the most famous and dig in. The general principles apply in any language.
Greg Kapfhammer 00:54:53 Thanks for sharing the details about those tools. We’ll have additional information in the show notes, so if you’re a Haskell programmer or a Rust Programmer or a Java programmer, you can translate the concepts of property-based testing into the specific tools for your programming language. Zac, as we draw our episode to a conclusion, are there any final points that you want to share with the listeners of Software Engineering Radio?
Zac Hatfield-Dodds 00:55:19 I think we’ve covered everything. It’s been a pleasure talking to you, Greg.
Greg Kapfhammer 00:55:22 It’s been great to chat with you as well. This has really been an awesome conversation. If you are a listener and you want to learn more about property-based testing, or Hypothesis, or any of the specific tools and technologies that we’ve mentioned in this episode, make sure to check the show notes for references and additional details. For example, you might want to listen to episode 322 from 2018 because it gives another take on how to use property-based testing. With that said, let me say thank you to Zac for being on the program. It was wonderful to chat with you today.
Zac Hatfield-Dodds 00:55:57 My pleasure.
Greg Kapfhammer 00:55:59 All right. This is Gregory Kapfhammer signing off for Software Engineering Radio. Thank you, Zac, and goodbye everyone.
[End of Audio]