Goran Petrovic, a Staff Software Engineer at Google, speaks with host Gregory M. Kapfhammer about how to perform mutation testing on large software systems. They explore the design and implementation of the mutation testing infrastructure at Google, discussing the strategies for ensuring that it enhances both developer productivity and software quality. They also investigate the findings from experiments that quantify how mutation testing enables software engineers at Google to write better tests that can detect defects and increase confidence in software correctness.
Show Notes
- Major Mutation Framework testing tool for Java
- Stryker Mutator testing tool
- GitHub – boxed/mutmut: Mutation testing tool
- Blog post describing mutation testing at Google
- Early description of mutating testing at Google
Related Episodes
- SE Radio 474: Paul Butcher on Fuzz Testing
- SE Radio 609: Hyrum Wright on Software Engineering at Google
- SE Radio 324: Marc Hoffmann on Code Test Coverage Analysis and Tools
- SE Radio 317: Travis Kimmel on Measuring Software Engineering Productivity
Transcript
Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Gregory Kapfhammer 00:00:18 Welcome to Software Engineering Radio. I’m your host, Gregory Kapfhammer. Today’s guest is Goran Petrovic, a staff software engineer at Google. Goran focuses on software quality improvement through the development and evaluation of new software tools and processes. Welcome to the show.
Goran Petrovic 00:00:37 Thanks Greg. It’s good to be here.
Gregory Kapfhammer 00:00:40 Today during our episode, we’re going to be talking about Goran’s effort to introduce mutation testing at Google. We’re going to start by exploring what mutation testing is and the ways in which Google’s software engineers tend to use it. Now Goran, I know you published a paper called “State of Mutation Testing at Google,” and if you won’t mind, I’m going to read you a sentence from the paper and then invite you to comment on it. Does that sound cool?
Goran Petrovic 00:01:04 Sounds great.
Gregory Kapfhammer 00:01:05 Okay. So the paper was called “The State of Mutation Testing at Google,” and the sentence was “Mutation testing assesses test suite efficacy by inserting small faults into programs and measuring the ability of the test suite to detect them.” So at a high level, can you tell us what is mutation testing and how does it work?
Goran Petrovic 00:01:26 Sure. So everyone is familiar with the standard criteria like line coverage, and it’s been in standard in the industry for many years. And mutation testing is just a step further in this to ensure that the code is not only covered with tests but actually properly tested in terms of assertions. You could imagine that you can have a function that you call from a test and don’t assert anything. It can have a hundred percent coverage, but it’ll never catch any real bugs. And mutation testing comes to help in terms of actually validating that these tests do something useful. The big picture is that why do we even write tests? Maybe 20 years ago or 10 years ago, people were complaining that they had to do it and they said, we don’t need that. Now it’s more or less a standard, but the real question is why do you do it?
Goran Petrovic 00:02:15 And the reason is we don’t want bugs to be introduced into the code and then pushed to our users. So the best way to do that is to actually have tests that can catch these bugs. Not only that cover certain lines of code. So mutation testing is a system where you can just change the code, change the implementation by mimicking a bug in a way. Let’s say replace A plus B with A minus B, and then run all the tests. If no tests fail, there’s something wrong there. So you might have coverage, but this is not a good test. And mutation testing is a way of generating many of these different code versions with slight changes they’re called mutants, and then running all the tests on all of them and calculating the score mutation testing score, which tells you how many of these mutants your tests were able to detect. And then that’s another number along with coverage that can help you understand the quality of your tests even better.
Gregory Kapfhammer 00:03:09 So what you’re explaining is interesting, and yet it may sound somewhat counterintuitive because normally we associate testing with the process of getting bugs out of our code. But you’ve explained that mutation testing is putting bugs into our code. Can you unpack that a little bit further? Why would we want to put a bug into our code?
Goran Petrovic 00:03:29 Sure, I do that every day. Programming is an art of adding bugs to your code and then removing them and probably you never remove all of them. There’s some remaining. So you could say that 99.99% of code that I write is full of bugs and eventually it’ll have so few bugs that they won’t be so important so I can release my code. And that’s a good reason to try to insert bugs into the code because it’s inevitable that we will do it as human beings, we are not evolved to write complicated code. We are evolved to run around in the in the woods. So this is very hard to hold concentration and the bugs will definitely happen. So mutation testing is just a natural way of simulating that. You just say, imagine if you make a bug here, imagine if you inserted the bug here, what would happen?
Goran Petrovic 00:04:15 And people sometimes ask like, but I didn’t, so my code is correct, why should I care? But software engineering is an integrated programming over time. So usually if you just write a code for like a one-off script or a competition, that that’s okay. But most of the code that the companies write will live for years and years, decades even. And you as an author are just the first author. There will be countless changes happening to the code, some from your team with the people who are familiar with the code, but some from completely different people who want to add a feature or want to do a refactoring. And even I don’t understand most of the things that I’m writing about, let alone if I’m editing someone else’s code. So the chance of me introducing a bug is huge and that’s why we want to protect against that.
Gregory Kapfhammer 00:05:01 So if I’m understanding you correctly, when I put these small faults or mutants in the program and my test suite detects them, then I know that if I later put that into my program, the test suite would still detect it. Is that the right idea?
Goran Petrovic 00:05:15 It is. And of course a student might ask, I’m not going to write in a function of sum A minus B instead of A plus B. But the idea is that these small changes, these mutants, they might not look like real bugs, but then they are correlated with real bugs. So this is the core assumption if that doesn’t hold then mutation testing doesn’t make any sense, but it happens to hold. So that’s the reason why inserting these small non bug looking changes, if you prevent them, you can also prevent bugs by tests.
Gregory Kapfhammer 00:05:48 You made a really good point about the correlation between these mutants and real bugs and we’re going to talk about your experience at Google later in the episode specifically in that area. Before we move on, I wanted to pick up on something you said a moment ago. You were talking about like A plus B and then you mentioned A minus B if I remember correctly. Is that what you mean when you talk about the concept of a mutant?
Goran Petrovic 00:06:13 Yes. That’s one example of it. There have been many, many different mutant operators developed over the years and the mutant operator is just a fancy way of saying how do I change your code to insert a bug into it? So when you encounter any binary operator like a plus or a minus, you can replace it with a division or a or a sub of any other operator that would fit in that code. So that’s an example of a mutant. Yes, or you can just say if A equals B, you can just say if A does not equal B, you can negate the operator.
Gregory Kapfhammer 00:06:42 So you’ve given an example of not equals and equals and plus and minus. Can you give one or two more concrete examples of mutants?
Goran Petrovic 00:06:51 Sure. The most boring mutant is deleting statements, let’s say delete a line, but because of the structure of the code they are the most prevalent because you cannot replace a plus with a minus in every line of code. It might not even have a plus. You cannot, negate a condition because not every line contains a condition, but almost every line can be deleted. So while it is boring, it is the most common mutant, just delete a statement, delete a line, delete a block. Let’s say you have an if statement and you just delete the whole else block. That’s one of the mutants that you could do. Others are just changing numbers from zero to one changing strings from whatever the string literal is to an empty string. You can always change the do an absolute of and then negate the number. This is one of the mutants that in Google we don’t do because it turned out to be the least useful of all mutant classes. That’s an interesting consequence of doing this research. But yeah, whatever type of change you can imagine there have been mutation operators proposed and implemented for that. Those are very simple that I’ve been describing, but they’re more complicated mutants. You could just cast a pointer into a base class orr do something strange with object oriented mutations or for each language, for example lisp, you can do multiple other different things and so on.
Gregory Kapfhammer 00:08:08 One of the things that I heard you say a moment ago is that mutation testing is this strong test adequacy criterion. And if I remember correctly, I think you said it was stronger than code coverage criteria like statement coverage or branch coverage. Can you tell us a little bit more what is a test adequacy criterion and then why is mutation testing quote unquote stronger than these code coverage criteria?
Goran Petrovic 00:08:32 Sure, Greg. So it’s very simple. You can define a criterion, let’s say coverage of lines and just count the lines that are covered overall lines and you get a percentage and this is your criteria and test suites you can say it’s adequate according to this criterion if it covers a hundred percent of lines and so on. For mutants it’s a bit different because you generate all the mutants that you can in the code and then run all the test suites and all of them and you calculate the percentage of the killed mutants and this is your mutation score or mutation adequacy score. And if it’s a hundred percent, which I’ve never seen, it means that all the mutants in the code have been killed. The issue with this is that while coverage is very well defined, each line is either covered or not branch coverage as well. Each branch taken or not mutation testing is defined in a sense that either a mutant is killed or not, but it is not defined in the domain of mutants. I have yet to see two different mutation testing tools that can generate exactly the same mutants on the same piece of code. They are different optimizations. There are different underlying technologies used that will yield different results and that’s why when you cannot compare these numbers across tools or programming languages or even your own invocations of the same tool on the same code.
Gregory Kapfhammer 00:09:48 I noticed you mentioned the idea of a mutant being killed. Can you explain what that concept is all about?
Goran Petrovic 00:09:54 Sure. So when you have mutated code, let’s say that the function sum that I talked about before that is just returning A plus B suddenly is changed into A minus B that constitutes a mutant. Hopefully there will be some kind of a unit test that will fail when you do this. If you assert that two plus two is four, suddenly this mutant will return a zero and this is not good. Hopefully there is a test that test and if the test fails, any tests covering this line fails, we say that this mutant is killed by this unit test and the tests that are not killed are called living or alive mutants and they are more interesting than the killed ones because they are the ones where you can actually add stuff to your tests to make them better.
Gregory Kapfhammer 00:10:35 Alright, that makes a lot of sense. I have two more quick questions before we go to the next part of our show. I noticed that you’ve been mentioning a process for mutation testing. What I’d like you to explore a little bit further is whether this process is manual or automated. For example, does the software engineer actually insert these mutants themselves or is there tool support for this process?
Goran Petrovic 00:11:00 Sure, there is widespread tool support and it’s all completely automated. As a developer, you don’t have to do anything, just wait until the system reports the living mutants in your code and then you can react on them.
Gregory Kapfhammer 00:11:12 This sounds like really an exciting thing, and I can imagine why software engineers at Google would be interested in using it. However, with that said, I remember that in one of the articles that you wrote, you said in 2016 that mutation testing was only used as a part of 93 code reviews at Google and then it grew to roughly 2,500 users in 2017. And by 2021 there were tens of thousands of engineers at Google who were using mutation testing. So now that we understand the basics of what mutation testing is, my question is as follows, why was it initially hard to convince engineers to use mutation testing?
Goran Petrovic 00:11:55 Because they’re smart people by design mutation testing generates a huge amount of mutants and then shoves them down your throat. When I started implementing this, what I did is I read a bunch of research papers, I was really excited, I started coding and in a few weeks, I had something that worked originally for C++ and then I started generating these mutants. The basic system was just reporting those mutants that were live in the command line and you could just look at them and it, you could save them into a file and go one by one. And my first run, I remember it took a few hours to run. I ran it on my, on a small code base of my team. I was back then in Google shopping when it finished, I looked at the file and could barely open it in Vim, it was huge.
Goran Petrovic 00:12:37 There were thousands, tens of thousands of mutants and after three pages and two coffees, I was so tired that I just closed the file and forgot about it for a few weeks because there was huge amount of mutants and most of them were not things that I would actually be adding tests for. I wouldn’t want to have that tested against, we call these unproductive mutants, but that was the first stage. And then of course because I, this was my 20% project, something that I would do on Fridays only and I forgot about it for a few weeks. And then later when I continued, I realized this doesn’t work. Like I cannot even myself who is excited about mutation testing, I cannot convince myself to look at this for more than 10 minutes. So I just decided at that point that we have to prune this somehow.
Goran Petrovic 00:13:20 And the first step was just to figure out what these unproductive mutants were. Let’s say you have a log line, and you delete the log line. Of course the test doesn’t fail because it doesn’t catch that and you would be surprised, but there are thousands of categories of mutants like this that are not interesting. Some of them might look good at the beginning, but then you quickly realize this is not something that the developers would want to write test for. So slowly but surely, I started adding this, encoding these heuristics as a filter. So I would not consider these lines of code to mutate at all. The logline is the, the simplest one, you can do it in 10 minutes, but as I said, there were thousands, so it took a long time. But once I was happy with the number of mutants in the output, I started getting my team to use it.
Goran Petrovic 00:14:05 That’s the 97 full requesting goal that you see. It’s all my friends who did it because I asked them to, they probably didn’t like the tool because it still wasn’t, the quality wasn’t high, the number of useful mutants, the percentage was like 12% I think at that point, not very useful. But with their feedback they would be labeling my set for me. They would just say, this mutant is useful, this isn’t. And then over time I could encode more and more of these heuristics. That’s how we got to a much better number. I think now it’s around 90% useful, 10% and not useful mutants. That’s why it took a long time. And that was the first big improvement, huge amounts of these heuristics. And then the second was I realized that while when you have these automatically generated tests, probably using some kind of a genetic algorithm that is trying to fit some criteria, let’s say killing mutants, the test that they generate will probably kill just one mutant in that line because that’s what they do.
Goran Petrovic 00:15:00 They evolve to do it. But humans are sufficiently evolved to not write such tests. So you wouldn’t test half of a line or a third of the line or even like some kind of a severe sub condition. We try to write good tests that are useful and they usually kill all the mutants in that line. So I realized if I wanted to run this for more than just my team, I’ve calculated once back of the envelope it would take 7 billion years with all the computing power currently available in for all humans to do one iteration of our 2 billion lines of code to calculate a mutation score. So obviously the universe didn’t like what I was doing, but I wasn’t discouraged by that. I just decided I’m not going to generate any of the mutants in a line more than one and then I’m going to pick that one at random. And that was the second. That second hurdle, once we got past that, we could start opening mutation testing to other teams because it wasn’t as annoying, and it wasn’t as useless.
Gregory Kapfhammer 00:15:55 Thank you. Those were some incredible insights, and I can absolutely see why you had to limit the number of mutants that you would actually present to a software engineer. I noticed a moment ago you talked about mutants that are not useful. Like for example, mutating a logging statement. Can you give us an example of a mutant that you would consider useful and then explain why it’s useful?
Goran Petrovic 00:16:18 Sure. Let’s say that in this simple, summation example, A plus B, most mutants are useful. If you replace that with return zero, then that’s a useful mutant. It means that you’re not really testing anything. Another example would be a function that determines whether user is of age or not something for your website. So you would have something like return user age greater equal 18 depending on the country. And this mutant that replaces this with true or false is useful. But I mean coverage can tell you that you probably did not just return false in that function because your user is always underage. You probably have something there. So those are less useful, but still they can indicate coverage, but coverage is much cheaper to calculate and then you should be using that instead. But if you change that to greater than instead of greater equal than 18, then it might catch a corner case.
Goran Petrovic 00:17:11 If you’re not implementing that the user of 18 is still not of age. So that’s a very useful mutant there. And if it survives, it’ll just detect a bug immediately in your code. Hopefully it’ll not because you implemented it perfectly the first time, but these are just small examples. There are billions of lines of code and any kind of this edge case where you do greater than equal or greater than is very useful, especially with iterators and the inclusive exclusive intervals and so on. These are really tricky to get right and mutation testing helps there a lot.
Gregory Kapfhammer 00:17:42 Thanks for sharing those concrete examples. Those made a lot of sense. One of the things that I’ve heard is that sometimes a specific type of mutant might motivate a software engineer to write something called a change detector test. Can you tell us what a change detector test is and whether or not that’s a good or bad type of test to write and how that connects to mutation testing?
Goran Petrovic 00:18:05 Sure. Change detector test is a test that will fail most of the time whenever you change anything in the code. So you’re not really testing the implementation, you’re testing some kind of implementation detail or the specifics. An example might be that you’re catching an exception and asserting the exact contents of that exception. And that’s usually a bad strategy because you might be changing that messaging. You should be asserting on the type of the exception and in these kinds of errors it should return this type of exception, but you shouldn’t care about the contents or you could assert on the subset of the contents that you’re more or less certain are not going to change. You don’t want to depend on like a, the debugging output of some kind of a function that is represented in the exception because that might change. An underlying object might change its own representation and then you’re going to have to fix your test. Ideally, well-written test suite, you could completely re-implement your feature, your code, everything based on everything new and you just rerun the test and they all pass. That’s means that you have mastered the art of testing, but they are not there yet.
Gregory Kapfhammer 00:19:07 Are there examples of certain types of mutants that tend to motivate an engineer to make this type of negative change detector test?
Goran Petrovic 00:19:15 Definitely. So we have tried over the years to suppress them using these heuristics, so we wouldn’t report these. And we always say in our findings, please don’t add a test unless you actually think it’s useful. And then we have the code reviewers that probably would not let you add the change detector test because they would say, this is not a useful test, please don’t do that. And so there we have a lot of guardrails for this, but the types of mutants that usually produce, these are things that for example, will change the exception message, let’s say, or add a one or a minus one somewhere where it’s, where it’s not being used arithmetically, but it’s being used for let’s say metrics or logging or sleep or whatever. And then people might be tempted to kill that mutant, but that’s not a productive mutant. They shouldn’t do that. But I think with all the guardrails, I think we’re doing well there.
Gregory Kapfhammer 00:20:06 Okay. That example makes a lot of sense. In a moment we’re going to start talking about the tools that you’re building at Google in order to actually perform mutation testing. However, before we get there, I’m wondering, could you tell us a high-level story about how mutation testing has improved the quality of some software product at Google?
Goran Petrovic 00:20:27 Sure. So those numbers are always hard to get, but many people will send us emails thanking us that the team that we have found prevented bugs and we have gathered a lot of examples like this over time. It’s always something like there is a small bug in the code and then people with the mutant they notice this, but if they just had a test to fix that, that’s one thing that’s a small improvement. Maybe you never introduced this kind of a bug. So it’s very hard to try to guess the future value of this new test case. But often people would just say this is an equivalent mutant meaning that mutant it does the same thing as the one as the original code. So you cannot kill it because it’s equivalent to the original code. But it makes them see something in their design and they just say, oh yeah, we need to completely change this. And then they refactor the code and they’re much happier with it. And sometimes it can be something that is not testable at all and because of this mutant while trying to kill the mutant, they have to change the code, make it testable, make it better, and then there are secondary improvements I would say even primer in this case where this refactoring comes into play.
Gregory Kapfhammer 00:21:36 Okay, that’s interesting. So it sounds like sometimes mutation testing might help you to improve your test suite, but in other occasions it may actually help you to refactor your program and improve its design. Is that the right way to think about it?
Goran Petrovic 00:21:50 Yes. And we have a paper that looks into this and we have found that this is the case. There is a lot of mutant that are resolved by actually changing the code, not adding tests. Usually both.
Gregory Kapfhammer 00:22:02 Okay, that’s incredible. We’ll make sure to link to that paper in the show notes before we move on. Two quick additional comments. First of all, can you tell us anything about the prevalence of mutation testing at Google today?
Goran Petrovic 00:22:15 I haven’t looked in a while, but I would guess more than 50% of engineers are exposed to mutants during their normal development cycle.
Gregory Kapfhammer 00:22:23 Okay, thank you, that’s really helpful. And then finally, before we move on, I know that sometimes people talk about mutation testing and then they also use an affiliated phrase which is mutation analysis. What are the similarities and differences between mutation testing and mutation analysis?
Goran Petrovic 00:22:40 I was also first confused by this because when I came to mutation testing, I came to it directly as an engineer and not as a researcher of any kind. So for me, what I wanted is to take this idea and make it useful to the developers and that’s why there’s this differentiation. Mutation analysis is just this adequacy thing that we talked about before, just getting the ratio of killed mutants, overall mutants and then you have this number. While mutation testing is an attempt to capitalize on that, on that technique and get improvements in the code base, mostly in tests from these developers. And in the end it’s very hard to try to convince developers to look at a hundred thousand mutants in their project and then most of them are not actionable. They will just not do it. It’s a reality that we have to face. And during code review it was the best location for actual mutation testing where you have already an attention of a developer and you can present some hints how they can improve code while all the other tools are doing that as well. Like coverage or testing frameworks or linters and so on, other analyzers. And this is the best place to try to smuggle some kind of extra information for them so they might act on it and then implement new tests.
Gregory Kapfhammer 00:23:58 Okay, thanks for that response. I think it is helpful to know that these mutation testing results are presented to a developer at Google during the code review process. If our listeners want to learn a little bit more about how software engineering at Google works, they’re welcome to check out Software Engineering Radio Episode 609 where we talked with Hyrum Wright all about software engineering practices and processes at Google. Now what I’d like to do is to turn our attention to the implementation of the tools that you’ve created. The overall system seems to be called Mutagenesis, if I’m pronouncing that correctly. And as you mentioned a moment ago, it’s a part of the analysis and code review process. Can you tell us a little bit about some of the specific components or features of this mutagenesis tool that I mentioned a moment ago?
Goran Petrovic 00:24:48 Sure. So the best way to get developers to do something and to act on some results is to introduce the least amount of friction in their workflow. So you should integrate either in the IDE, shifting left if you can, or into code review tool or in the release tool or any other well-established process that they’re used to already. So for mutation testing it made sense that it’s introduced into the code review because it depends on test results and coverage and this is where all of that happens anyway, so this was just a natural extension. And the way mutation testing works is that it waits until coverage is calculated for a given pull request when it’s sent for review and then it calculates a map from each covered line to a set of tests that actually test this line. This is critical because if you didn’t have coverage information you would just have to run all tests for each mutant and that would be very, very expensive.
Goran Petrovic 00:25:43 So we have to just wait until coverage is complete and then filter all the tests by those that cover the line making it much cheaper. So the next step is then this analyzer will take this information and it’ll generate mutants using this mutagenesis process in all of the eligible lines of code. So it’ll take the covered lines of code and it’ll filter from the heuristics that we talked about before the mutants that we don’t consider productive. And then the remaining lines will be mutated only a single mutation as I said, and then the set of mutants will be actually created and test run executed against them. And then finally you will have a report if there are any living mutants in your code. You will get a finding in the code review on that line saying something like, if your code is changed like this, let’s say A plus B to A minus B, then no tests pass. Please consider adding tests to cover this.
Gregory Kapfhammer 00:26:39 If I’m understanding correctly, there’s a mutagenesis server for the various programming languages that you supported at Google. Can you give us some examples of programming languages for which you’re able to perform this kind of automated mutation testing?
Goran Petrovic 00:26:54 Sure. What we support is C++, Python, Java, Go, TypeScript, JavaScript, Kotlin, Dart, Common Lisp, SQL, and most recently Rust.
Gregory Kapfhammer 00:27:09 Wow, that’s really cool. So if there’s a mutagenesis server for each language, do you implement that server in the language that it is targeting? Can you explore that process a little bit further and explain how you actually build this mutation testing tool support?
Goran Petrovic 00:27:25 Sure. That’s the most interesting part for me. I’ve always liked compiler theory and playing around with the Abstract Syntax Trees, so that’s the most fun for sure. It depends on the language. So for C++, we use Clang because it can give us type information. It is slower than the simple pars you would use, but C++ is so complicated that I think Clang is the only way to do it properly. A lightweight system might just parse it using some kind of a parser generator and that would be fine, but you wouldn’t be able to do all the kinds of suppressions that we do. Detecting pointer types, for example, in comparisons with zero or null pointer, they cannot be done without a proper compiler front end. For Java, we use an open-source Java parser in Java, for Go we just use its own and Python or their own internal AST libraries that I have.
Goran Petrovic 00:28:14 Well, for Python we don’t have too much type information, but Python doesn’t really have type information anyway, so it’s, I think it’s good enough. Lisp is interesting. Lisp is its own parser and compiler, so everything is done in Lisp. Rust is in Rust using the macro framework, which is quite nice. It’s strong enough that you don’t need to use the actual compiler. You could just use its own internal macro AST. It works quite well. JavaScript is interesting because that’s done in Java because our compiler for JavaScript is in Java historically, so that’s done at closure. So it’s done in Java for TypeScript, it’s done in TypeScript itself and it’s quite nice and no problems there. Kotlin and Dart as well are done in Kotlin and Dart, so it’s mostly in their own. The only ones are like the node JS and the four TypeScript, I think it that’s and SQL is done in C++ it’ll be even I’m not crazy enough to try to implement an AST parser in SQL itself. So it could be maybe done, but that’s done in C++.
Gregory Kapfhammer 00:29:16 Okay, that makes a lot of sense. I remember a moment ago you referred to something called the AST or the Abstract Syntax Tree and that made me think, and in fact I’m wondering when you’re doing this parsing of the C++ code or the Dart code or the Python code, do you parse that to some type of common intermediate representation that all mutagenesis servers understand or do they use the Python AST or the C++ AST that comes from that specific parser? Can you explore that a little bit further?
Goran Petrovic 00:29:51 Definitely. That’s a very good question. I don’t have a good answer what the best strategy is, but I can tell you what we do and we have a different AST and different framework for each of these languages. There are some things like the tree sitter and the whatever is called parser, generator libraries that can parse everything to a degree. They don’t give you type information for all languages. Some are supported I think like Go and Java are usually supported. There is a compiler frontend for them even in this generic AST frameworks. But in the beginning I just hacked it and I liked Clang and VM, I liked playing with it. It was nasty. It was very hard to make any progress, but that’s what makes it fun, I suppose. I think now for most of the mutators it takes me a week to implement them first two days to figure out how to do recursive visits of the tree and then it takes me two or three days to implement all the mutators like for changing plus to minus for negating things, for deleting lines and so on. But yeah, I tried once to migrate everything to this common infrastructure, but I gave up quickly because I couldn’t get this pointer type information for C++, which is used in few heuristics that are important. Although I would like to finish that project one day and then compare. Because I don’t really know what the impact would be of not having it, but I know I gave up there.
Gregory Kapfhammer 00:31:12 Okay. You mentioned a moment ago something called a tree sitter parser and although we won’t investigate that during our discussion today, we’ll put a link in the show notes to the tree sitter parser and what it is and how it works. Many people have probably used that because it’s integrated into their text editor and it helps them with syntax highlighting. But to be clear, you’re not using tree sitter in your current implementation. Is that the right way to understand it?
Goran Petrovic 00:31:36 Yes. In each language we use its own native support for parsing except for SQL and JavaScript where we use Java and C++ libraries that implement the parser.
Gregory Kapfhammer 00:31:46 So the idea of a mutator is that roughly speaking, it takes as input an abstract syntax tree and it produces as output a mutated abstract syntax tree that has this synthetic bug inside of it at a high level. Is that the right way to think about the tool?
Goran Petrovic 00:32:03 I think so, yeah. Practical implementations are usually simpler. You don’t have to mutate the actual tree because that’s sometimes hard. What you do is you just find, let’s say you find an A plus B somewhere or if A plus B greater than five, you first check that it is in a line that’s been covered, then you check that it’s not in an unproductive line, let’s say a log line. And if that works out, then you can just remember I am in this line from this bite offset to that bite offset from this character to this character. I have to change this width and you can just throw a dye and say, I’m going to change it to A minus B. And then you just remember that the plus is at this offset and you just replace it with a minus and you just have to remember the edit when generating mutants, that’s enough to take the original code and just do a simple substring operation on it. That’s much easier than actually mutating the whole tree.
Gregory Kapfhammer 00:32:54 Okay, that makes a lot of sense. Now, I know you mentioned a moment ago that you’re focusing on like a specific part of the change in the code. So I’m guessing that means that when someone commits code, you don’t do mutation outside of the region that they’re committing. Is that correct?
Goran Petrovic 00:33:11 It is. And the original design was always like that because this was very natural for us. Like everything is done there anyway. So I never really thought about doing anything more. It was only when I started talking to researchers and going to conferences, reading papers that I realized that I was the outlier. I thought that this was how it’s done. But yeah, for us, you can imagine this yourself. If someone sends you a pull request to review, why would you care about a file that is in seven directories away that you don’t know about, that you are not an owner of you have never seen before? And there is a mutant there like it is not, let’s say useful information in terms of the probability of you actually reacting on it. So we want to, the developer attention is the most precious resource we have.
Goran Petrovic 00:33:56 We don’t want to annoy them in a sense because mutation testing, very few tools in Google are mandated. Most of the things, people can use it if they want to and they can opt out if they want to. If they don’t like your tool in two days, they will just disable it for them and then you lost a user. So we have to be very careful in how we present the information. We have to filter a lot. We have to even lower the number of mutants per let’s say per hundred files lines. We can have at most, let’s say 10 or five, we played with that number historically because you don’t want to overwhelm the developer as well because you have to remember this is just a small tiny analyzer tool in the whole sea of hundreds of analyzers that are running all the time. And the developer just wants to get done with their work, they want to submit their pull request. And unless you are most often on the spot, you will lose a user.
Gregory Kapfhammer 00:34:50 You mentioned that the engineer is a precious resource at Google and certainly that makes sense. Your comment though also made me think of the fact that mutation testing is associated with something that’s very expensive. And I remember in one of your papers you mentioned something like there are 500 million tests that are being executed at Google every day and that it’s gatekeeping something like 60,000 code changes and yet somehow mutation testing is still scalable even though you’re running 500 million tests every day. So Goran, that’s astonishing. What is it that allowed you to make mutation testing scalable at Google, even though it’s something that’s associated with being very, very computationally expensive?
Goran Petrovic 00:35:35 That’s a very good question. As I said before, the universe doesn’t like us running all of these tests. If you want to get the mutation score for a repository, it’s not possible. But if you filter out most of the mutants, we call it aggressive suppression heuristics. If you mutate only lines of code that have changed in the code review, in the pull request that have been covered within it and then filter aggressively and then generate only one mutant per line and not more, and then even have additional limits in the number of mutants per file or per line changed, then you have reduced number of mutants by 90 something percent that you have to evaluate. And this makes it feasible. And Google has a very cool infrastructure for building and testing and evaluating stuff. So there’s a lot of caching involved. When you compile a piece of code and you just change one line, you’re going to get 99% or higher of compiled cache object cache. So mutants are always compiled in such a way that you compile once and then all the mutants each is just a small change on top of the previously compiled blob. So it’s quite efficient in doing that.
Gregory Kapfhammer 00:36:44 So if a software engineer is making a change to the Google code base, if you could give an example, how long does it take to run mutation testing on that code change?
Goran Petrovic 00:36:54 It usually takes a few minutes, but that depends on a few things. One is of course the size of the change. We don’t, for example, run mutation testing on very large changes, especially the LSCs, the Large Scale Cleanups that are automated. We don’t run them on any automated or autogenerated pull requests only on human generated pull requests and usually takes a few minutes. But as you remember, you have to wait until tests are done and then coverage is done, which is whatever it takes for that before we only start running mutation testing after those two things are done. But most of the time I think it’s done before of course the pull request is submitted.
Gregory Kapfhammer 00:37:31 Okay, thanks. That’s really helpful. If listeners want to learn more about some of the things that we mentioned, they could check out Episode 59, which is about Static Analysis or check out Episode 324, which connects to Code Coverage Analysis. Goran, before we go on to the ways in which you’ve evaluated mutation testing tools at Google, is there anything else that you’d like to highlight about the implementation of the tooling that you’ve created?
Goran Petrovic 00:37:56 Well I think the only interesting difference from the open-source implementations of mutation testing and what we have is that we actually generate the mutants based on the ASTs. There are some open-source tools that do that as well, but it’s usually prevalent to change the byte code in Java or the intermediate representation of LLVM in C++. And I think the main reason for that is that for us, it’s very important to be able to show the developer exactly how the mutant would go. They have to actually look at it and you have to show them. And that’s very hard to do if it’s based on the bytecode. I think that those tools are great and I use them, but it’s just sometimes impractical to show to render the actual diff.
Gregory Kapfhammer 00:38:39 Okay. And I’m glad you brought that up. So you’re saying you actually changed the source code directly. One of the mutation testing techniques that other people have developed to achieve scale is by embedding all of the mutants into something called a meta mutant. Do you use meta mutants in the tooling that you’ve built at Google or did you adopt a different approach?
Goran Petrovic 00:38:59 So I think the reason why we don’t have the mega mutants is that I was lazy and this was easier to do. I think they’re a great idea, but you have to compile everything only once. But as I said before, the compilation overhead is not too big when you have very good object caching. So that wasn’t such a big deal for us because to evaluate one mutant, it takes, let’s say a hundred thousand compilation actions, but then for every subsequent one, it only takes a single one. So that makes it cheaper. And I also wanted to do the meta mutant implementation, but I didn’t just have the time yet, hopefully in the future.
Gregory Kapfhammer 00:39:35 Okay, that makes a lot of sense. We’ve covered a lot of cool implementation details, but what I want to do now is to dive into a discussion about how you actually evaluate these tools and whether or not you can sense that they’re giving clear value to the engineers at Google. So one of the studies that you ran said that you had almost 15 million mutants and you had collected data about mutation testing at Google over a duration of an entire six-year period. So first of all, I got to say wow, that is super exciting. And then second of all, I’m just wondering, do developers actually write more test cases when they are shown these 15 million mutants?
Goran Petrovic 00:40:17 That’s a very good question. So we also asked that question, of course our intuition was, they of course did because we did. But working with research, you always have to measure things and you have to show that they’re true, which I think is a very, very, good strategy because we’ve also found things that were not true here. Luckily, it turns out that if you present mutants to developers, they will add more tests. And how we measured this was when you send a pull request for review, there’s a certain amount of code in it and certain amount of testing in it, right? And you have sent it for review. When you submit it, this will again contain the submitted pull request, will contain some amount of code and some amount of testing in it. If the approval was given immediately and there’s no feedback, there are no comments, there’s no mutants, it’ll probably be the same.
Goran Petrovic 00:41:02 It would just be as is. You sent it for review, no comments, let’s submit fine. But when you have mutation testing comments saying, please add a test to kill this mutant, please add a test to kill that mutant. We find that the submitted code has more tests than they do in the control, which is everything else that doesn’t have the mutants in them. And furthermore, this in itself is not too exciting. You could have more tests that don’t really contribute to catching bugs to having high quality because as I said before, you could just call a function from a test and cover it completely a 100%, but have zero assertions in your test, meaning that you have a 100% coverage, you have added more tests, we have detected you adding more tests, but it doesn’t really serve in catching any bugs, preventing outages. So we also looked at the correlations of these bugs, like the mutants that are alive when you send a pull request and the mutants that are alive when you actually submit the code, and we have detected these consistently going down for the CLS is pull request within Google. It’s a perforce change list name. We have noticed that these pull requests that had mutants now have less live mutants. So the tests actually killed them, which is great news because if they just added coverage and not kill detect mutants, they will not be as useful.
Gregory Kapfhammer 00:42:23 Okay. So it sounds like what you’re saying is that the developers are not only writing more test cases, but they’re writing better test cases because they can detect more of these synthetically seeded mutants. Is that the right way to understand it?
Goran Petrovic 00:42:37 Exactly.
Gregory Kapfhammer 00:42:38 Okay. Now we mentioned this earlier in our conversation and I want to revisit it now. You talked previously about something that’s called the coupling effect. Goran, can you explain what the coupling effect is and how you studied it in your experiments at Google?
Goran Petrovic 00:42:54 Sure. So as I said before, the mutants that we insert don’t really look like real bugs that humans do. No one is going to just return falls from the whole function that has to do some logic. But the coupling hypothesis says that these mutants are similar to bugs in a way that they actually, the test cases that catch these mutants will also likely catch the real bugs. And the coupling measurement, you could say is that if you had a hundred bugs, you could see how many times there would also be a live mutant that would correspond to it. And then of course, if you write a test to kill this mutant, it also should detect that bug and this is a coupling ratio.
Gregory Kapfhammer 00:43:39 Hmm, that’s really interesting. So have you done any experimental studies that show if you detect this type of mutant, it helps to avoid this type of bug going into production at Google? Do you have a story that you could tell about that?
Goran Petrovic 00:43:52 Sure. We have done this analysis of taking hundreds or even thousands of actual bugs. And this is not easy because mining bug repositories is very, very hard. If you don’t look at each bug, the distribution is going to be completely wrong because no project keeps a consistent, let’s say bug hygiene, so it’s very hard. So we had to actually look into each of these bugs and make sure that there are real bugs that happen in code, and we found an associated fix to this bug. So what we did there is just analyzed whether when this bug was introduced, there would’ve been a mutant, had mutation testing been enabled for this project. And if so, would the fix usually add the test as well? Would the added test to the fix also kill this mutant? And I think 69% of cases, it was the case that the bug and the mutant were coupled. So that means that if these tests had been written, had mutation testing been enabled, these bugs would’ve been prevented.
Gregory Kapfhammer 00:44:52 So that sounds like it’s an easy way to make a clear win argument for the introduction of mutation testing.
Goran Petrovic 00:44:59 Sure. But it was very expensive to calculate.
Gregory Kapfhammer 00:45:01 Okay, that makes a lot of sense. I’m wondering if we could take a quick step back. There are many different ways that we could improve the overall quality of our software. And you’ve mentioned test coverage analysis and we’re talking about mutation testing now. I’m wondering, can you give us your intuition? How did you know that it was the right time to introduce mutation testing at Google given the analysis pipeline that you already had in place?
Goran Petrovic 00:45:28 Well, that’s a funny story. I had no idea. I joined the team in Google shopping, and as it used to be the case, everyone had a 20% project. And my manager said, oh, we started this mutation testing last hackathon, maybe you can inherit it. And I’m like, oh, I have no idea what this is. I’ve never heard of this, but I will inherit it because I don’t really have a choice, right? So I started reading those papers and eventually rewrote all of their code because it was a hackathon project one week of like just try to get a proof of concept working. And then it became my project over time. So it was their idea, and I donít know where it came from because it was before I joined the team, but I just pushed and pushed for years to get it done. And I mean, most things you can get done, the question is how much you have to change the original idea for it to be feasible. And I mean, I call this, it’s a capacity driven development. Like if you cannot run all the mutants, then choose a subset that makes sense. If you cannot evaluate all lines, well drop some. And I mean, you have to eventually come to a workable thing that still brings value. You don’t have to implement whatever was envisioned 50 years ago on paper. Right? So I think for many project like this, good enough is good enough.
Gregory Kapfhammer 00:46:40 Those are several very interesting examples. So again, focusing on a specific line or maybe even randomly selecting a specific mutant, are there any other experimental results that you have discovered that have ultimately helped you to enhance the mutation testing tools at Google?
Goran Petrovic 00:46:57 There was one idea we had, but we finally didn’t do it because we didn’t think it was ethical, but it was just like a random analyzer that would pick a random line in the code and said, oh, maybe you should look at this line again and think about it. I mean, in the end we didn’t do it, but it was a fun experiment. But in mutation testing, yeah, many improvements that we’ve done over the years for the framework. Some of them are just these heuristics. I added one today, I still get after it. I think it’s been nine years. People still will say, this is not a useful mutant. And then I will go and look and I will try to figure out how I can suppress those kinds of mutants. You cannot imagine how many logging frameworks we have, how many wrappers of logging frameworks we have, and all the other things like caching mechanisms. Because if you look at it, a caching mechanism, if you remove it, it doesn’t change the functionality. It just makes your thing go slower if it’s uncached. So any mutant around caching will be an equivalent mutant. And learning about this from our users and just implementing the support for that is the way we have improved the tool the most, I think.
Gregory Kapfhammer 00:48:00 I think this sounds like an awesome experience, and you’ve been mentioning these various heuristics from a back of the envelope perspective, are we talking about 20 heuristics or hundreds of heuristics? What kinds of heuristics are built in? And roughly speaking, how many are there?
Goran Petrovic 00:48:16 So considering there are 11 languages, there are thousands of heuristics, but many of them fall into the same group. Let’s say, don’t mutate calls to these kinds of functions. So logging functions, things like, let’s say collection capacity. Let’s say you construct a vector, you can give it an initial capacity, but you can reduce it by one and nothing’s going to happen because the vector is going to grow. There is no change. The same goes with the Java collections. If you just increase the capacity of a collection, it’ll just grow or shrink as it needs to. It won’t change the behavior. So all of these are, you could say they are 50 different containers with five different initialization methods. I would call them a single heuristic, but they’re like 50 instances of each of this. So in in those terms, I would say they’re like five or six groups of heuristics and then thousands of instances of them. And we have a TSE extension that has an appendix that is like 10 pages long, listing examples from these.
Gregory Kapfhammer 00:49:13 Okay, we’ll make sure to reference that in the show notes so other people can pick it up. I have to say this has been a fascinating and thought provoking conversation. What I’d like to do now is to turn our attention to some of the broader lessons that you’ve learned from introducing mutation testing to the engineers at Google. The first thing that I’ve heard you mention more than once during our discussion today is that sometimes you’re taking ideas from research papers on mutation testing, and then you’re trying to introduce them into practice and studies at Google. What have you learned from that process? Can you tell us a little bit more?
Goran Petrovic 00:49:48 Definitely. That’s been a very interesting journey. So what I’ve learned is that, as I said before, developer attention not only at Google, not only developer, just human attention is the only resource in the universe that’s worth thinking about. So wasting it is always a bad idea, I think. So just thinking about how you can improve the product and the development process is the best way to start. So people are sometimes if you have an idea, you read the paper, you just want to present it as is and build it in, and any deviation from it gives you resistance. That’s, I think, a bad approach. You just think how we can help our product or our teams, how can they be more efficient, how can they have less bug in production? And then work backwards from there. And then I think that way things fall into place.
Gregory Kapfhammer 00:50:37 You’ve invested significant effort and time in building these tools, and I’m assuming you’re using the tools on a regular basis as well. Can you tell us what have you learned and how are you a better software engineer after you’ve built and started to use these mutation testing tools?
Goran Petrovic 00:50:53 They’re annoying as hell. I know when I send a pull request for review, I know I’m going to get a mutant, and I know exactly where, because I’ve implemented it and I already start writing an extra test for it because it’s so annoying. And then sure, it’s there and it’s in that line. The most annoying thing is even when I fix that, there’s another that I didn’t see coming. So I think people learn to anticipate them and immediately when they write code, they see, oh, this test only covers this left branch. I need to add something for the other branch as well. So they, I think I have certainly started writing more and more tests to cover more cases. And we have measured this as well, that over time there’s this thing that we called exposure. I think over time, as you have been using mutation testing for a longer time, you tend to write more tests proactively. Even in the beginning when you send your initial pull request, you tend to have less live mutants over time and more added code tests. And I’ve definitely noticed this in myself.
Gregory Kapfhammer 00:51:51 That’s a fascinating result. As we draw our conversation to a conclusion, I want to pick up on several things that you’ve said and ask you to give advice to two different groups of individuals. So first of all, do you have advice for researchers who are building mutation testing tools or studying their effectiveness? And then additionally, there are many listeners who want to know how they can get started with mutation testing. So can you give some advice to both of those groups?
Goran Petrovic 00:52:19 Sure. So I have to admit my bias of a large company viewpoint. The way to start using mutation testing is very simple. Just Google mutation testing, you will find for your language a tool that is good and you can try and run it. My main advice would be initially integrated it immediately into the workflow, whatever it is, GitHub workflows or whatever you have, Jenkins. Hudson, no matter what, because asking people to do an extra thing a day is probably not a good idea and no matter if it’s at work or at home. And of course, suppress mutants as much as you can, find a tool where you can just ignore most of them, because otherwise no one is going to look at thousands of them. That would be my advice.
Gregory Kapfhammer 00:53:02 Okay. That will be helpful. And we’ll in fact link to some of the GitHub repos for various tools like PIT for the Java programming language, or Mutt Mutt for Python or Stryker for JavaScript. I think those are all helpful ways for people to get started. Are there any other things that you would like to share as an advice to a software engineer who wants to learn mutation testing and maybe try it out for the first time?
Goran Petrovic 00:53:27 Definitely. I think also, as you asked before, for researchers, what we all would like is to have, I would say, user-centric view of mutation testing or analysis. Just think in terms of something that someone in some company can pick up a paper, read it, and then sit down and implement that, and then get some benefit from it. I think I find those papers the most interesting. And for people starting out, I think patience and also keeping in mind that the end goal is not to kill mutants, it’s not to write tests, it’s to make better products.
Gregory Kapfhammer 00:54:00 Wow. That’s a great point to conclude on. In our conversation today, we’ve talked about mutation testing, and you’ve hinted at the key idea, which is that ultimately mutation testing helps us to achieve higher quality software. I know that there may be something we’ve overlooked. Are there any additional topics that we should briefly discuss before drawing our discussion to a conclusion?
Goran Petrovic 00:54:21 The only one that comes to mind, it would take us 10 years to, I think, conclude so we can skip it, but, and that is the question of quality. We don’t know what quality is. We cannot measure it.
Gregory Kapfhammer 00:54:31 Okay. Now, for those of you who are interested in investigating the idea of quality, you might want to read the book Zen and the Art of Motorcycle Maintenance, but we’ll leave that discussion for a later time. Goran, thank you so much for being on the show. It was a great pleasure to chat with you today. Is there any final call to action that you would like to share with our listeners?
Goran Petrovic 00:54:52 No, I think that’s it. Thank you, Greg.
Gregory Kapfhammer 00:54:54 Thank you, Goran. For those of you who are listening and interested in learning more, I hope you’ll check the show notes so that you can explore mutation testing and learn how to use Goran’s insights and tools inside of your own software development adventure. Thanks, Goran. Talk with you again soon.
Goran Petrovic 00:55:11 Ciao.
[End of Audio]