Gregory Kapfhammer, associate professor at Allegheny College, discusses the common problem of ‘flaky tests’ with SE Radio’s Nikhil Krishna. Flaky tests are test cases that unreliably pass or fail even when no changes are made to the source code under test or to the test suite itself, which means that developers can’t tell whether the failures indicate bugs that needs to be resolved. Flaky tests can hinder continuous integration and continuous development by undermining trust in the CI/CD environment. This episode examines sources of flaky tests, including physical factors such as CPU or memory changes, as well as program-related factors such as performance issues. Gregory also describes some common areas that are prone to flaky tests and ways to detect them. They discuss tooling to detect and automatically mark flaky tests, as well as how to tackle these issues to make tests more reliable and even ways to write code so that it’s less susceptible to flaky tests.
- “A Survey of Flaky Tests,” Transactions on Software Engineering and Methodology, 31:1, 2022, https://www.gregorykapfhammer.com/research/papers/Parry2022/
- “Evaluating Features for Machine Learning Detection of Order- and Non-order-dependent Flaky Tests,” Proc. the 15th Int’l Conference on Software Testing, Verification and Validation, 2022, https://www.gregorykapfhammer.com/research/papers/Parry2022a/
- “Surveying the Developer Experience of Flaky Tests,” Proc. the 44th Int’l Conference on Software Engineering – Software Engineering in Practice Track, 2022, https://www.gregorykapfhammer.com/research/papers/Parry2022b/
- “What do developer-repaired flaky tests tell us about the effectiveness of automated flaky test detection?” Proc. the 3rd Int’l Conference on Automation of Software Test, 2022, https://www.gregorykapfhammer.com/research/papers/Parry2022c/
Related SE Radio Episodes
- 474 – Paul Butcher on Fuzz Testing
- 461 – Michael Ashburne and Maxwell Huffman on Quality Assurance
- 322 – Bill Venners on Property-Based Tests
- 283 – Alexander Tarlinder on Developer Testing
- 256 – Jay Fields on Working Effectively with Unit Tests
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Nikhil 00:00:16 Hello and welcome to Software Engineering Radio. I’m your host, Nikhil, and in today’s Software Engineering Radio episode, we welcome Gregory Kapfhammer, an associate professor at Allegheny College with a PhD from the University of Pittsburgh. His research focuses on software engineering and testing, particularly on flaky tests. Gregory is involved in various academic roles, including associate editor, program committee member, and reviewer. He also contributes to open-source software testing and analysis tools on GitHub. Our discussion will center around the common problem of flaky tests encountered in software development. So welcome to the show, Gregory, and was there anything in the bio that you might want to add that I missed?
Greg Kapfhammer2 00:01:05 First of all, let me say thank you for inviting me to chat today about flaky tests. I’m excited to be a guest on Software Engineering Radio, and I think everything that you shared was wonderful. Let’s dive into the conversation.
Nikhil 00:01:17 Great, let’s start with the basics, right? So what is a flaky test and why as a software developer should I care about flaky tests?
Greg Kapfhammer2 00:01:26 So a flaky test is a test case that passes or fails even when you don’t change the source code of the program under test or the test suite itself. This is a real challenge for software developers because the quality of the test case signal is diminished. This means that a test might pass sometimes and then fail sometimes, and the developer won’t be able to know whether the failures actually indicate that there’s a bug in the program that they need to resolve.
Nikhil 00:01:59 That sounds kind of annoying, obviously. That’s not great, right? So what exactly does it kind of hinder, right? So obviously one is basically it’s not great to have a test that sometimes starts and fails, but are there any other process-driven things that might get impacted?
Greg Kapfhammer2 00:02:17 Yeah, that’s a really good point. So when you have a flaky test in your test suite, it often hinders continuous integration and continuous development. So you might have a build that fails in continuous integration and you wonder why it fails, and you also wonder why it sometimes passes and sometimes fails. So if that’s due to a flaky test, then it causes you as a developer to stop trusting the CI and CD environment and therefore limits your ability to add new features or to introduce bug fixes.
Nikhil 00:02:52 Yeah, that sounds like something that I would do as well. If I had a CICD environment that’s flaky and there’s something to be delivered, I’d like, oh yeah, it fails half the time. I’m going to just throw the dice as they say. Maybe we can talk about a real-world example that you might have encountered where this is common.
Greg Kapfhammer2 00:03:12 Yeah, that’s a really good question. So I myself have implemented a number of software testing or automated assessment tools that I have released to GitHub. One of the tools that I have built is an automated assessment tool that I use when I am checking my students’ work. There were some order-dependent test cases within that test suite, and so if I ran the tests in a different order than the standard order, the test cases would sometimes fail in a flaky fashion. This made it really difficult for me as a developer to know whether or not the new feature that I had added was correct, and the test case was failing in a flaky fashion, or there was actually a real bug inside of my program.
Nikhil 00:03:59 Right. I’m sure you must have given your student an earful about that. So, okay, so you talked about order-dependent and non-order-dependent. So is that a way to categorize flaky tests as order-dependent or not? And maybe you could give us some examples of what a non-order-dependent, if you can.
Greg Kapfhammer2 00:04:19 Yeah, that’s a great point. So first, let me say that an order-dependent flaky test case is a test case that tends to pass and fail in a flaky fashion when you run it in a different order than which it was originally intended to be run. This often happens in a regression testing environment when you only want to run a subset of test cases on your developer laptop or in CI. Another kind of flaky test case is called a non-order-dependent flaky test. These are test cases that pass and fail in a flaky fashion for reasons not connected to order. So they could be related to timing issues, they could be related to asynchronous issues or networking issues or maybe issues that are related to the CPU or the memory or the file system. Those non-order-dependent flaky tests are also critical to limiting our confidence in the correctness of the program and the test suite.
Nikhil 00:05:26 Right. It sounds like the non-order-dependent ones would be also harder to debug, right, because with the order-dependent one, it’s kind of easy to say, okay, fine, this, if you change the order, it fails, and if you keep a particular order, it’s the same. So, you talked about CPU and memory and all that and kind of led me to this particular thoughts. So what are the common sources of flaky tests that are related to the external environment? For example, what are some of the other common sources that we have?
Greg Kapfhammer2 00:05:53 Yeah, your mention about the CPU and the memory is a really good one. So in my experience, both in my own test suites and in the real world ones that we’ve examined on GitHub projects, we have found that if the CPU changes significantly — that is, for example, it goes to being a slower CPU model, or the memory changes significantly, for example, much less memory is available. Test suites tend to become more flaky in nature. That’s because of the fact that you often might have certain resource constraints in mind as an assumption when you’re writing the test case as a developer. And then those assumptions can be violated when you’re running it in a different execution environment that has like, for example, a different class of CPU or a different amount of memory.
Nikhil 00:06:49 Great. So, those are obviously some physical sources. Do you think there may be actual problems in your program that might lead to a fake flaky test?
Greg Kapfhammer2 00:06:59 That’s a really good point. In my experience, it has both been a problem in my test suite and also a problem in my program. For example, there’ve been cases when I was building applications that had a significant web user interface and for those systems, sometimes my test cases were flaky because parts of the web user interface didn’t appear fast enough, or according to some timeline that was expected by the test case. In that situation, it was really a problem in my program that manifested itself as a flaky test.
Nikhil 00:07:39 Right. So it’s kind of like a performance issue in the way you render your web user interface, for example, that might be kind of causing this. Okay. What about other things? Do you think like there could be a time-related factor in this because I’ve noticed that I, at least when I do my tests, I always get nervous when it comes to time because you have different time zones, you can run servers in different places, et cetera. Right? So what are your experiences with time? What do you think that is something that we should be careful about?
Greg Kapfhammer2 00:08:15 So as a developer, modeling and correctly handling the date and the time is, in my experience, one of the most challenging things to do correctly. So there may be situations in which the test cases pass if they’re run in a certain time zone or if they’re run at a certain time of day, but if assumptions about the test cases’ date or time or the program’s date or time are violated, then the test case may start to fail in a flaky fashion. So I would say it’s not only issues about timing, that is how long it takes for things to appear in the UI or to run in an asynchronous fashion, but it’s also about the date and the time modeled from the perspective of both the program and the test suite.
Nikhil 00:09:04 Right? Right. Just to share a real-world use case from my work, I work in a, in the logistics space. So for me, for us, basically time zone is a major deal, right? Because you’re moving cargo across the world, and there have been more than one occasion where literally a test would be running in my time zone and then we deployed to the cluster and the cluster happens to be in America, and suddenly all the test failed because the flights that were supposed to start or the test that I’d assumed the flight would start on a particular date has now suddenly shifted to the previous day, right? So, this is, this has been my experience, at least with flaky tests. So obviously there are these common sources. So how can we detect this? Is there a way that we can actually, before we have to go and send it to the CICD and before we actually committed to the Git repositories, is there a way for us to figure out that okay, these tests might be flaky even before we do that?
Greg Kapfhammer2 00:10:05 Yeah, I think that there are a number of things that developers can do to detect that a test case is flaky. One of the things that I normally first do is rerun a test case some fixed number of times. So if the test case tends to pass or fail in a non-deterministic fashion, then I often will first run that test case in isolation a repeated number of times. If it passes in isolation a repeated number of times, then I might start to gradually increase how many test cases I run and the context in which I run those test cases and then continue to rerun it a fixed number of times. To give a concrete example, let’s say that I have two test cases that are interacting inadvertently through shared state in a database or in a file system. What I might then be able to discover is that when I rerun those test cases together as a pair or as a group, they negatively influence each other and cause flakiness. In that case, to solve the problem, I would try to introduce better setup and tear down methods for those test cases, and then help to ensure that they’re no longer flaky when I run them both in isolation and additionally in a group of other tests.
Nikhil 00:11:36 Right. I mean, that’s, that I guess is one way of doing it. It also speaks to, you should probably make your test fast, because if you want to run it multiple times like this, then definitely that’s, that’s something that you should consider. Are there any tools that we can use kind like a linting tool or some kind of statistical analysis that we can do maybe to figure out if a test might be flaky?
Greg Kapfhammer2 00:12:00 Yeah, that’s a really good question. The first thing I should say before I get to your comment about statistical analysis is that often I use tools that help me to do an automated rerunning as soon as the test case is detected as flaky. So that way I don’t have to do any manual rerunning on my own. I just allow my testing framework to re-run the test cases for me. I also use tools both for Java programming and for Python programming that can attach markers to my test cases. I could have a flaky marker that automatically appears on a test. If rerunning indicates that it’s likely to be flaky, then I can change my rerunning process so that at least for a period of time it screens out those test cases that are the flaky ones, running the rest of the test suite, and then allowing me to focus back on maybe finding and fixing the sources of flakiness in the test cases that were automatically marked as being flaky.
Nikhil 00:13:09 Oh, okay. Cool. So these would be frameworks that you run on top of the actual unit testing framework that you have, right?
Greg Kapfhammer2 00:13:16 Yeah, that’s exactly right.
Nikhil 00:13:18 Okay. So now we’ve detected a flaky test, then what, what should we do now? How do we actually tackle this particular test? What is the approach that we do?
Greg Kapfhammer2 00:13:29 One of the things that I do at this point is investigate how the test case is interacting with the execution environment. So for example, I might introduce instrumentation into either my program or my test suite that instrumentation can do things like track how the test interacts with the file system or track how the program interacts with the file system. I can also track other things like code coverage, memory usage, and execution time. One of the things that I found is that if a test case takes up too much memory or repeatedly accesses the file system in a way, I didn’t predict that, that might be an indicator as to why it is flaky.
Nikhil 00:14:17 Right. When you say instrumentation, just to clarify, you’re talking about things like logging or adding metrics that you can and measurements about on your code, or do you have like a specific instrumentation tooling in mind?
Greg Kapfhammer2 00:14:32 That’s a really good point. So when it comes to instrumentation, you can as a developer introduce that into your system on your own, or alternatively, you can use a wide variety of open-source and or commercial tools that can monitor the execution of your program and your test suite. As a concrete example of that, there is a product called Datadog, and Datadog can monitor both the behavior of your program and the test suite and then give you hints that will help you to discover why your test case may be flaky.
Nikhil 00:15:10 Ah, okay. That is cool. Okay, so we have looked at it and we’ve kind of instrumented it. Should we even repair it? Can we just delete this test? I mean, how do we actually make the call whether you should spend your time, because after all, it is a test, you’re not actually delivering features here, right? You’re not delivering business value. So one argument could be made is that, hey, you can’t spend that much time on the test, just delete the test. So how do you actually make that kind of determination, let’s say about this?
Greg Kapfhammer2 00:15:42 Yeah, that’s a good point. So there are two things that I would say in response. First, one of the things that I often do is quarantine the test case and stop running it in continuous integration. As I mentioned previously, another thing that I’ll sometimes do is mark the test case as a flaky test, and then I will skip it for a period of time. However, if I quarantine a test case or skip a test case, I may be missing out on the fault detecting potential that is associated with that test. Another point that I would briefly add is that sometimes there’s a silver lining to flaky test cases. Sometimes test cases that are flaky are actually pointing out a problem in the business logic of my program, and so it might be helpful to me in certain circumstances to at least invest some time to root causing the flaky test because it could lead me to actually finding a bug in my program.
Nikhil 00:16:44 Oh, awesome. So that’s great. I mean, this is another example why I love testing, because even a bad test can give you more information than no testing at all. Right, that’s great. So, okay, I’ve decided to track down this particular flaky test and try to see if I can repair it. How would I go about doing that? Should I, like you said, quarantine the test well, but that’s basically, essentially skipping it. Do you think it’ll be better to maybe look at the setup and tear down functions or try to see if we can make sure that this test is well factored or isolated better than currently?
Greg Kapfhammer2 00:17:21 What you’re suggesting is something that I definitely try to do, especially if I’m dealing with an order dependent flaky test case, I will often try to make sure that I have the best possible set up and tear down methods associated with all my tests, especially those test cases that are flaky. Obviously there’s a tradeoff here when it comes to the flakiness of my test suite and the performance of the overall testing process. So one of the tradeoff that I have personally experienced is that if I clear out a significant amount of state from a database or a file system or things that are shared in memory, that could cause my test suite to run more slowly than it would otherwise because I’m executing a lot of heavyweight code in either setup or tear down. So ultimately, I want to try to find the right balance between clearing out enough of the shared state so that I don’t have flakiness, but maybe allowing some shared state as long as it doesn’t negatively influence the outcomes of test cases.
Nikhil 00:18:34 Okay. So what happens? What would you do if you actually went down this particular route and, at the end of the day you found that, okay, this is something that’s maybe a little bit beyond your control, right? So it might be that, okay, there is a network problem between your machine or your laptop and the server on which the test is trying to connect to a third-party service that you can’t mock out, and basically that is kind of causing this. Do you think there is actual value in still retaining the test but not actually running it that often? More kind of like from a, maybe don’t call, don’t use it as a unit test, but use it as an integration test or a smoke, a smoke test or something that you run less often, but just to make sure that things are working? Do you think there’s value in that?
Greg Kapfhammer2 00:19:28 Wow, that’s a great point. Before I directly answer your question, let me pick up on something you said a moment ago. One of the things that I tried to do, which you talked about, is the idea of introducing various kinds of mock objects or mock services into my test suite. The idea there is that instead of interacting with some unpredictable third-party service, I’ll actually have a mock of that third party service. And in my experience, that often helps me to control flakiness, of course, at the expense of perhaps limiting the realism of the test cases that I’m running.
Nikhil 00:20:07 Right? And in some cases that might be the even point of the test, right? You might have written the test to kind of exercise that part of the functionality. So that’s kind of like, so if you’re kind of like driven up the wall down that situation, then what do you recommend that we do?
Greg Kapfhammer2 00:20:23 Yeah, so now we come to the question of if we can’t have mocks, but we recognize there’s flakiness perhaps because of our dependence on some third party API, then in that situation, I think we should do precisely what you suggested a moment ago, which is to make those test cases run infrequently. Like maybe they’re part of an integration test suite, or maybe they’re part of a smoke test suite, but the idea is maybe we don’t run them frequently inside of CI, but we run them as a separate process perhaps right before we make a major release or when we’re introducing a big feature to our system.
Nikhil 00:21:06 Great point. And I also actually want to circle back a little bit on, point that you made earlier regarding, you know, certain a class of tests, which are the UI tests, let’s say, the web UI or the web tests and you mentioned that they tend to be more, they tend to be more cases of flaky tests amongst them. So can you maybe talk a little about why that is and, if you have any recommendations to kind of optimize that as much as possible to kind of reduce that flakiness?
Greg Kapfhammer2 00:21:36 So in my experience, when I’m building test cases for web user interfaces or graphical user interfaces, sometimes my test cases are flaky because of the fact that there are timing constraints in the user interface, and my test case may make an assumption about how quickly something appears in the web UI or in the graphical user interface. So in those situations, I try to make my test cases more event driven in nature as opposed to making them timing or weight based in nature. Another thing that I have found super helpful, especially when I’m dealing with web user interfaces or mobile apps, is avoiding hard coding my test cases for the user interface to individual pixels or individual locations. I want to make them more about existing widgets in the web user interface or in the Android app, and less about their precise location on the screen, because that often introduces flakiness, which I think can be avoided if you adopt one of the strategies I mentioned.
Nikhil 00:22:53 Right. So, just to kind of make sure that I understood, so what you’re recommending is that A, don’t kind of introduce timing-based test, basically try to look for a listen for events that have happened and then basically write your test around, okay, this event happened, therefore this must have, this test has passed or not, right? And the other one, was regarding making it a little less brittle by looking for the existence of certain components of features in the UI rather than looking at it from a pixel based or a point based placing it in the layout, so speak. Excellent suggestions. Have you thought about also, so when you look at the event wait, right? One other thing also is the event might not happen, right? So that might be because your code is wrong, right? So your test is waiting for an event, but the event might not appear because you wrote something or your business logic broke, right? So how do you actually deal with, so does your test basically wait for that event for a certain amount of time and then give up, or how do you actually write that test?
Greg Kapfhammer2 00:24:07 Yeah, that’s a good point and it makes me think of two things in particular. So first of all, we know that we can’t wait for an indefinite period of time. So in that situation, we still ultimately do have to introduce some type of timeouts, but we try to make the timeout big enough so that the event will always occur within a reasonable period of time. The other thing in response to your point that I have found helpful for both graphical user interfaces or web interfaces or code code-based testing is to think carefully about the Oracles or the assertions that are inside my tests. If my Oracle is looking at too much of the state, whether that’s the UI or the database or the file system, it may be waiting for events to happen or content to appear, but that’s not really critical for the test cases assertions to be able to pass. So another strategy that I’ve adopted is thinking very carefully about the oracles or the assertions and ensuring that they’re narrow enough so that they don’t become flaky.
Nikhil 00:25:21 Right. Just to kind of enlighten myself, what do you mean when you say oracle? Can you kind of explain that term to me?
Greg Kapfhammer2 00:25:28 So the oracle is the part of the test case that helps us to know whether the test case did or did not pass. So if we’re taking a simple example, the idea of the oracle is to accept the output that comes back from the function under test and then to inspect all or some portion of that output. If what you’re looking for appears in the output, then the oracle or we could say the assertion will allow the test case to pass, but if the oracle looks in the output and it says, hey, I can’t find what I’m looking for, then in that situation, exactly, the test case would fail.
Nikhil 00:26:10 Okay. Okay, cool. So I’m used to the term assertions for that . So this is, today I learned. Thanks for enlightening me. So moving on, obviously you had mentioned there were tools that we could use to detect flaky tests and maybe even handle aspects of flaky tests. Can you kind of talk about, do you know of any tools or can you recommend a few tools that in Java and Python, because those were the ones, languages that you had mentioned you did your research on, that you can use for flaky tasks?
Greg Kapfhammer2 00:26:44 So the first thing that I would say is that there are many tools for both Python and Java developers that allow you to do various types of intelligent rerunning of your test suite. Normally these are plugins to a specific test automation framework or some type of build system. For example, the Maven build system has the surefire plugin that will do rerunning of your test suite when a flaky test case is detected. There are similar plugins that you can use if you’re a Python developer and you are using Pi test as your testing framework. For both of those tools, I would say that rerunning is one of the easiest and nicest places to start when it comes to leveraging tools that will help you to detect and or fix flaky tests. The other area is a bigger area that we may want to investigate in greater detail, and that is the use of tools that use machine learning in order to help you to find and fix flaky test cases.
Nikhil 00:27:52 Right. That, that sounds fascinating. And machine learning is quite the fashion right now. Can you elaborate a little bit more on how that can be done?
Greg Kapfhammer2 00:28:00 So when it comes to machine learning, the first thing that we should talk about is whether we’re using a supervised or unsupervised or reinforcement machine learning technique. My specific examples and the tools that we’ve built and released on GitHub all use what are called supervised machine learning algorithms. The idea there is that you’re going to collect information about the program and the test suite, and then you’re going to connect that with information about whether the test case is or is not flaky, and then use a trained machine learning model to make predictions about the flakiness of test cases in your system.
Nikhil 00:28:47 Okay, cool. But I mean and this is just me being a complete novice in this, I’m not that familiar with machine learning techniques, but my understanding is that machine learning for the large part is a probabilistic method, right? So if you apply a machine learning technique, they can at the most tell you probabilistically that this is probably the solution that you’re looking for. How do you actually deal with that? Is that the method that you used, which is a supervised machine learning method? Do you get better results from that and you know, are you more confident about your results there?
Greg Kapfhammer2 00:29:25 That’s a good point. So let me address it in part and then I’ll turn it back over to you for further clarification and question. The first thing that I would say is that you have to have some information about both the program and the test suite, and the better information you have, the better the machine learning algorithm is going to be able to make predictions. The features that you’re going to collect are what I would call static features and dynamic features. The static features are those that you can extract by looking at the source code of the program and the test suite. The dynamic features are those that you can extract by actually running and observing the program and the test suite. And again, my first main point is the better features you have, the better the machine learning algorithm is going to be able to do at predicting the individual test cases and whether or not they’re going to be flaky.
Nikhil 00:30:32 Right. So when you say features, could you give us some examples of this? Is it like the number of commits? Is it the users? Is it any aspects of the code? Is it metadata about the code? What kind of features are you talking about?
Greg Kapfhammer2 00:30:48 So when it comes to a static feature, some of the things that we might extract are related to what’s known as the abstract syntax tree or the AST of either the program or the test suite. The idea might be the more complicated or deep that is the abstract syntax tree than maybe the more likely that it is going to ultimately manifest in flakiness. That would be again, an example of what we would call a static feature. A concrete example of a dynamic feature might be things like the memory overhead of a test case, the utilization of the file system on behalf of the test, or maybe additional information about the code coverage associated with running the test suite on a system under test.
Nikhil 00:31:44 Right, okay. And what you’re saying is that with a combination of these features, your machine learning then program then builds a model that can predict with a reasonable amount of accuracy areas of the program or areas of the test, which are flaking?
Greg Kapfhammer2 00:32:01 Yeah, so the idea is, that’s exactly right. So the idea is you collect all of these features and then for every test suite you run for every test, you run the test case repeatedly in isolation, and then you notice that there’s flakiness then that would suggest to you it’s very likely a non-order dependent flaky test. Then if you rerun the test suite in groups or the test suite as a whole and you find certain test cases are flaky, then that’s an indicator that they’re likely to be order dependent flaky tests. Those then are the labels that you would feed into the machine learning algorithm. So you give it information about a test behavior and its static structure. You give it information about the program’s behavior and its static structure. Then for every test, you give it the label from rerunning, which would indicate whether the test case is flaky or is not flaky. You then train a model like for example, a neural network or a random forest or other kinds of machine learning algorithms, and then you can use that trained model to make predictions about new test cases or revisions to test cases as they come into your test suite.
Nikhil 00:33:29 Cool, that sounds really cool, and thank you for going into that because this was new and I never thought about machine learning as a way to kind of look at improving flaky tests. Let’s move into a little bit further in a slightly different topic, which is obviously we have tooling, that handles flaky tests after they’ve happened and we have now tooling. You talked about where they can help you predict maybe flaky tests coming up. But how can developers avoid the introduction of flaky tests in the first place? Are there any practices or processes that can follow to help them make sure that they’re not writing flaky tests?
Greg Kapfhammer2 00:34:11 One of the things that I tried to initially do to avoid introducing flaky test cases is always run my test suite in a random order, when I’m doing development on my laptop. I also add something into my CI configuration so that in CI, I also run my test suite in a random order. Both of those strategies are super helpful when it comes to avoiding the introduction of order dependent flaky test cases inside of my test suite.
Nikhil 00:34:45 Okay. So obviously running the test suite in a random order, there’s probably tooling to do that as well, right? You don’t have to actually consciously run them in a random order?
Greg Kapfhammer2 00:34:56 That’s right. Yeah I often use tools that will automatically rerun the test suite in a random order. So for example, there are several tools that integrate with the Python testing tool called PI test and it will allow you to rerun the test suite in random orders, and you can do that again, as I said before, both on your development laptop and in continuous integration.
Nikhil 00:35:21 Okay. Are there anything that we can do in the writing of tests as there kind of like practices we can do when writing tests to avoid writing flaky tests?
Greg Kapfhammer2 00:35:32 One of the things that I try to do is to ensure that I write simple test cases that are easy to understand. So a goal that I set for myself whenever possible is to avoid introducing conditional logic or loops into my test suite. So when I’m writing a test suite in J unit or PI test or another unit testing framework, if I’m introducing loops or conditional logic and then putting assertions inside of those constructs, to me that’s a warning sign that maybe my test case could become flaky if different conditional paths or execution structures are taken when I’m running the test suite. Whenever it’s possible to do so, I also try to ensure that I have as few assertions as possible in my test suite because that will also give me more confidence in the test caseîs reliability, when it’s run frequently on a developer workstation or NCI.
Nikhil 00:36:35 Okay. So key simplicity and single responsibility. So as few assertions as possible, can we kind of write tests that are very focused? Is there something we can do in our, the way we write programs? So write our actual program to make it less susceptible to flaky tests? Maybe we could write something that is easier to test in somehow, in some way to kind of help with that.
Greg Kapfhammer2 00:37:02 There’s one strategy that I always think of to aid my testing process, and it has to do with the number of inputs that go into the function under test and then the number of outputs. So if I have a function that accepts a significant number of inputs, but only a single output or no outputs, then I probably don’t have enough visibility, and so therefore the test suite might not be able to run in a fashion to ensure the correctness of the program. So partly it’s about ensuring that my functions in the program under test have a sufficient amount of visibility so that I can ensure that they’re working correctly. Another thing is connected to what you said a moment ago, just like test cases need to have a single focus. I try to ensure that the components or functions in my system also have a single focus because that makes them easier to test as well.
Nikhil 00:38:04 Yeah. So I think that’s a good amount of practical advice and thanks Gregory for doing that. And yeah, I mean I think this has been a good episode. Do you have anything in mind that we did not cover that you’d like to talk about for flaky tests?
Greg Kapfhammer2 00:38:22 One of the things that I think we’ve not yet talked about is the isolation that you sometimes introduce into a testing process when you’re using something like a virtual machine or a docker container. Often if you’re not able to control the state in a way that is appropriate, it can be helpful in my experience to introduce extra kinds of isolation through VMs or through containers. You can run those on your own development workstation, and you can also set it up in CI so that at various points during the testing process, you start up a fresh virtual machine or create a new docker container instance. And even though that’s a little bit more costly, it may give you more confidence that flakiness is not going to develop when you’re executing the test suite.
Nikhil 00:39:17 Yeah, that’s a great point about the docker container and in today’s world of Kubernetes and docker containers, it also makes it easier for you to try and replicate your production and environment more, right? Because you could conceivably use the docker file and that specifies the amount of CPU and the amount of memory that the program would probably be running under both in production as well. So you can kind of help you kind of find your flaky test early and potential performance issues as well if you use the docker container as a test suite. That’s a great point, Gregory. Thank you. And is there anything else that you wanted to cover?
Greg Kapfhammer2 00:39:55 Well, I wanted to connect to something you said just one moment ago. One of the things that I have found super helpful is to actually rewrite my docker containers or to have multiple docker files that do in fact change the amount of memory that’s available or alternatively that change how much of a CPU is available. I often run my test cases in different kinds of docker containers with different types of resource constraints because when I do that, it often helps me to identify defects that are related to timing issues, resource issues, or scheduling issues. So I think the point you made a moment ago is a really important one to remember. We can use docker containers to change the execution environment and that’s critically important.
Nikhil 00:40:49 Nice. Yeah, so this was a great kind of an overview and I think we covered a lot of ground. I want to thank you, Gregory, for the time and for answering these questions so well and giving your experience and talking about, you know, real world scenarios. It’s quite enlightening. Thank you for your time, Gregory.
Greg Kapfhammer2 00:41:10 Thank you so much for inviting me to be a guest on Software Engineering Radio. It was awesome to be able to chat about flaky tests and I hope the listeners of this episode will follow up with you and me and check the material in the show notes to learn more about flaky test cases and some of the solutions that we’re developing both to find and fix flaky tests.
Nikhil 00:41:33 Absolutely. Yes. We will be having Gregory’s contact details, his Twitter handle, LinkedIn, et cetera, as well as links to his papers on flaky tests and other topics that he has been working on in the show notes. Thank you once again Gregory and have a good day.
Greg Kapfhammer2 00:41:51 Thank you very much as well. Have a great afternoon.
[End of Audio]