SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

Vilhelm von Ehrenheim, co-founder and chief AI officer of QA.tech, speaks with SE Radio’s Brijesh Ammanath about autonomous testing. The discussion starts by covering the fundamentals, and how testing has evolved from manual to automated to now autonomous. Vilhelm then deep dives into the details of autonomous testing and the role of agents in autonomous testing. They consider the challenges in adopting autonomous testing, and Wilhelm describes the experiences of some clients who have made the transition. Toward the end of the show, Vilhelm describes the impact of autonomous testing on the traditional QA career and what test professionals can do to upskill.

This episode is sponsored by Fly.io.

Show Notes

Related Episodes

Other References:

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. Today I will be discussing autonomous testing with Vilhelm von Ehrenheim. Vilhelm is the co-founder and Chief AI officer of QA.Tech, a startup that develops autonomous agents that can interact and test the functionality of webpages. He has over 10 years of experience in data science and machine learning domain before co-founding QA.Tech, Vilhelm built Mother Brain at EQT. Vilhelm has published papers in prestigious conferences such as EMNLP, KTD and CIKM. Vilhelm, welcome to the show.

Vilhelm von Ehrenheim 00:00:54 Thank you. I’m very glad to be here.

Brijesh Ammanath 00:00:56 We’ll start with the fundamentals, if you can help by defining what is autonomous testing and how does it differ from traditional automated testing?

Vilhelm von Ehrenheim 00:01:06 Yeah, so I like to think of the testing and the levels of autonomy in different stages. So the first stage is manual testing where nothing is really automated. You’re just doing everything as a human and try to potentially repeat the same thing again as you have done before. The next stage is where you start using automation, so scripts or different kinds of programs that can repeatedly do the same things over and over that has been popularized by tools like Cypress, Selenium and Playwright. Today we see more and more things that comes into a new category called autonomous testing where we level up the level of autonomy even more. So instead of it being hard coded scripts, you focus more on either self-healing so that you can kind of don’t have to spend as much time developing and maintaining the test suites that you have or you have fully autonomous agents that can understand and validate different kinds of objectives that you want the page to support.

Brijesh Ammanath 00:02:12 Right. Can you expand on that a bit more and maybe walk us through the evolution of software testing? How did it evolve from manual to automated and now to autonomous?

Vilhelm von Ehrenheim 00:02:24 Yeah. I think the manual side of thing comes pretty natural. When you have built something that you want to ship to a potential customer or a user, you want to make sure that it works. And this is something that I think most developers are very accustomed to. You try the different features that you have built, you click around or you interact with it in different ways to make sure that it functions the way that you have intended. The automation of that comes natural. So when you have different ways that you want to test your software, usually you use different kinds of testing in different layers. So you have things like unit tests, testing specific snippets of code, you have integration tests, making sure that stuff works across different systems. Then you have the end-to-end tests where you script that something is working in the browser, in the application or something, and kind of program hard code those steps.

Vilhelm von Ehrenheim 00:03:20 So, for example, maybe you have a possibility to send an invoice in your system or do a checkout, for example. Then you script what should be filled in and you make sure that it clicks on the right buttons and then you wait and try to validate that it went through as expected. On the autonomous side, well first of all, the automated tests are pretty hard to maintain. When you think about hardcoded things in general, they’re very brittle to change. And what’s problematic with scripting something and testing that against a system that is continuously evolving and changing is that then those tests will continuously break. So, when you build a new feature or you change something in your checkout flow, then suddenly all of your tests are failing, not because it’s no longer functioning, but because they no longer do the right thing.

Vilhelm von Ehrenheim 00:04:17 So the buttons have changed or the identifiers on the page are not the same anymore, and that then requires the developer to go back to that code and also update the test suites to make sure that they kind of adhere to the new changes that you made. On the autonomous level, we try to mitigate that by erasing the abstraction one more layer again. AI and Machine Learning systems are essentially designed to be able to handle a vast kind of range of changing input parameters and still produce like a reasonable answer. So essentially generalizing across a lot of different potential things that could happen, which is the same as a human would do. So for example, if I added an extra button in a step in a checkout, then I wouldn’t fail the test because I understand that, oh, that’s a new button and I can take a decision not to click it or interact with it in different ways and still be able to complete the checkout. And this is where AI comes in as well. If we change the application in any of the numerous ways that we normally do when we develop them, it’s then possible to let AI understand and take decisions in real time when it’s doing the testing instead of having to rely on updating all of these tests that we have created before.

Brijesh Ammanath 00:05:31 And that is what you meant by self-healing tests?

Vilhelm von Ehrenheim 00:05:36 Yes, exactly. There is like two categories here. There are those that focus more on automatically updating the test scripts that are available or there are those that rely more on agentic systems that take the decision in real time as you see it. So either you kind of analyze the changes and handle it. Those changes in a nicer way in your automated testing scripts or you have an agent or an AI that can actually interpret the page and understand that we’re trying to do the checkout, we’re testing whether or not the checkout is working. Then if a new button comes up or if it’s changes colors or if it’s moving around to another location on the page, it’s still able to navigate there and perform the checkout to ensure that it works.

Brijesh Ammanath 00:06:22 We’ll deep dive into agents and how they function in a later section. But before that, I just wanted to cover a few more fundamental questions. You mentioned tools like Selenium, Juni, Cypress that are used for automated testing. What are some of the popular tools for autonomous testing and how mature are they?

Vilhelm von Ehrenheim 00:06:43 So there’s more and more tools coming out. There’s some that have been around for a while that are adding more AI features on top of them. So for example, you have tools like API tools that has had kind of GUI based way to be able to do or low code, sorry, a way to kind of do testing where you have different kinds of steps and instead of them relying on a specific identifier, you could use AI steps and things like that instead, that can kind of analyze just that specific step. Then there are tools like QA Tech where I work that run even a higher kind of abstract, a level of abstraction when we use agents instead. And there are a few other competitors of ours and similar tools that are coming out now. And I would say the maturity is still in the earlier stages when it comes to agentic-based testing, but we’re definitely moving closer and closer to that being a very mature area as well.

Brijesh Ammanath 00:07:44 Right, and are most of the tools focused on the UI layer?

Vilhelm von Ehrenheim 00:07:50 There are different tools for different layers. So you do have, when it comes to unit testing and integration testing and API testing, there’s has been like a suite of tools for this over a long time as well. Using AI to do, to help you write good unit tests for example, that is something that has naturally been evolving as AI gets better and better at coding. We also see some tools that are kind of evolving around API testing and making more smart features around that as well. But where we, I think the testing community have been struggling the most has been in the end-to-end testing where like we test a full application where everything is kind of coming together, which is also where sometimes really strange things can happen because if you test something, you should test something very kind of isolated and specific so you understand what’s functioning and what’s not for that component. But when you put all of those components together is where something gets to be very hard to test and also much more prone to errors.

Brijesh Ammanath 00:08:54 And are there testing scenarios where manual testing still plays a critical role?

Vilhelm von Ehrenheim 00:09:00 Yes, I think so. I think we’ll see more and more evolution in this space, but there is a concept of exploratory testing that is still pretty new. When it comes to AI features. There are features where an AI can kind of try different things and interact with it in a less kind of planned matter, but I think there’s still definitely a space for humans both in the exploratory side but also to ensure the entire kind of quality process of your development. Though we can talk more about explorative testing, but in general that it’s when you have certain area and you kind of explore all the different kinds of edge or kind of surface boundaries around that feature and what could potentially change if you change states and land in different kind of ways, which I definitely still think there’s need for you to be involved.

Brijesh Ammanath 00:09:51 Right. Do you have any examples of stories of where companies have transitioned and implemented autonomous testing? What was their journey like?

Vilhelm von Ehrenheim 00:10:00 Yeah, so we have a number of clients that have been transitioning more into autonomous testing and you would see some different categories in there. One of our customers were having both a suite of automated tests and a team of QA engineers that were both going to building these and doing manual testing. They had to let go of some of their engineers during the financial crisis. So kind of had this much smaller team. And that team then over the years had a very hard, hard time kind of keeping up with the need for all the testing that they had. So they partnered with us and were able to kind of speed up their team so that their team could do more interesting things and not only do these repetitive tasks in the browser and manage to reduce their testing costs drastically.

Brijesh Ammanath 00:10:57 How does a human proof that AI generated tests work?

Vilhelm von Ehrenheim 00:11:01 So this is slightly different depending on the tools I would say, but in general, when you look at QA Tech as a tool, we try to be as transparent about the AI’s reasoning and itís kind of execution path as possible. So as a human, when you get like a bug report or something similar from either a manual or tester or an automated script, what’s important in order to be able to debug and kind of deep dive into those issues that have been encountered is to have a full recording of the session and as much data as possible of the session. Ideally a recording that’s not only always the case with especially manual, but even in automated testing it’s hard to do sometimes. So I think it’s equally important in the autonomous testing to be very transparent about what’s happening and show as much of the underlying kind of functionality on the page as possible. So we show the full recording of this session with all the steps and the reasoning that the agent has taken it all of the different steps. You could also see console logs, network logs, and all of these different things that could potentially show why something has happened and make it easier for an engineer to reproduce and then subsequently fix.

Brijesh Ammanath 00:12:21 Right. Coming back to the example where the company transitioned over to autonomous testing, the other constraint was primarily lack of capacity amongst the testing team, but are there prerequisites that a company should consider before they consider autonomous testing? Whether that’s from a technology front or does it work with legacy technology or from infrastructure perspective, is there a minimum bar from an infrastructure perspective which needs to be looked at before you consider autonomous testing?

Vilhelm von Ehrenheim 00:12:51 In general, you need to have some kind of environment for the agents to run in. So from the infrastructure side, you need to have an isolated environment that the agent can work in that is like reachable and you have the functionality that you want to test in. So for example, if you only develop your local machine and then deploy to production, then maybe there isn’t really an environment that you have that you could test things in. I would say that that is pretty bad practice in general, but if you do, then that might be hard to use most of those autonomous testing tools, then you would have to look for something that could run locally on your machine instead. Another thing that is a prerequisite in general is that you, I think you need to have like some kind of problem to start with.

Vilhelm von Ehrenheim 00:13:38 Like either you don’t really have any testing and you would like to upgrade your testing and start to kind of building out the testing suite so that you can ensure that your functionality is as should or you have like a large suite of autonomous tests and you have a problem kind of maintaining them and it takes a lot of time and effort. Then you have that kind of specific thing in mind to start trying to optimize against or you have manual QA and it’s expensive or they have a hard time keeping up. So I think you should look at those different cases slightly differently. Like you have a problem then you know what you’re trying to kind of achieve with implementing another solution and then you can kind of track those metrics and see that you’re actually succeeding with your implementation.

Brijesh Ammanath 00:14:22 Right. And from a technology perspective, are there constraints where, to give an example, maybe does autonomous testing work on software which has APIs? So if you have a legacy technology which does not provide a, would autonomous testing still work over there?

Vilhelm von Ehrenheim 00:14:40 Our solution booth work with it as long as its browser based. I think there are new tools coming out for native applications as well, but currently we only support browser-based applications. So you need to have like some kind of URL where this page is deployed and you can let the agent interact with a page. When it comes to APIs, like if you only have a specific set of APIs and things think you should look at other tools for that specific thing. There’s a lot of different kinds of ways to automate and, test up APIs in a good way. I would say you could definitely use AI for that today in order to help discover different kinds of potential inputs for the API that you could, so that you can cover a larger kind of set of the potential surface area.

Brijesh Ammanath 00:15:29 We’ll move on to the next section, which is we’ll deep dive into autonomous testing and we’ll also explore the role of agents. Let’s start off maybe if you can explain the essential components of an autonomous testing system.

Vilhelm von Ehrenheim 00:15:41 Yeah, so we’ve focused on the agentic systems specifically. Normally what you would have to have as a different components in this system is as all agents you essentially need to have some kind of core processing model that could look at input and produce output for you and then you would have a way to, for that kind of system to be able to observe the environment, which in this case is the browser. So you would’ve to have some kind of browser components that can fill back information about the page that is relevant for the agent to take good decisions and then usually also need some kind of memory component that can record and store information that is relevant for the agent to keep as it continues along its trajectory. And still, I would say today you usually need some kind of planning component as well. You could either do that with AI or you can kind of do other kind of smarter planning solutions as well using machine learning or search. But in general, in order to be able to produce a good test trajectory, you need to have information about this, the system that you’re running against so that it can plan a potential test execution beforehand and then kind of evolve through that with the agents.

Brijesh Ammanath 00:17:05 Right. So to summarize, you need at least four components. You have the core processing model, which is AI model. You have a component which observes the environment. You have the memory component; I believe that’s where the self-healing and learning will happen. And then you have the planning where the test trajectory or the test cases are planned.

Vilhelm von Ehrenheim 00:17:26 Yes, exactly. I also forgot to mention there that you need to be able to execute things in the environment, otherwise nothing is going to happen. So you also need the possibility to let the core processing execute commands in the browser.

Brijesh Ammanath 00:17:41 You also mentioned that you were going to focus on agent systems for autonomous testing. What are the other types of autonomous testing systems?

Vilhelm von Ehrenheim 00:17:50 I briefly mentioned that earlier that there are a few different solutions that focus more on self-healing of the hardcoded tests. So you would, for example, have a hardcoded test run and then you would have like a recorded session of how that looks and then you can use AI models to self-heal if something has changed, which is not necessarily a real error. So then say for example, in the case of the checkout, if you added another step then instead of it being broken and you would’ve to fix it manually, you would’ve an AI assist you in fixing that automatically and analyzing whether or not it was a reasonable failure or not. And then there are also other tools. There’s a newer startup called meticulous as well that does something similar where they rather analyze different changes in the rendering of the different pages. So they look at smaller changes and try to analyze whether or not that is kind of intended or not. And then you also have another category where they primarily focus on analyzing user sessions. So then if you record user sessions like this commonplace in different product discovery tools like full story or similar, then you could look at those sessions and then identify whether a user actually has encountered a bug as well. So letting AI look at the session and say, oh, there was an error there.

Brijesh Ammanath 00:19:30 What are AI agents in the context of autonomous testing?

Vilhelm von Ehrenheim 00:19:35 So AI agents are a way for us to be able to execute tests and analyze and take decisions as they move. When you take the examples that we said before, like if something unexpected happens, then a normal test would not have the possibility to recover from that, whereas an AI agent has the possibility to both observe and take decisions and take actions in the system, right? So then if something unexpected happens, like for example, something is missing or something has moved or something like that, then it has the possibility to analyze whether or not that was kind of intended change or whether or not it was actually some kind of bug. And then it can also have the possibility to then take decisions on what to do next without necessarily knowing upfront what it was supposed to do. So it can, in the case when some, when we added a new step to the checkout, then it can identify, oh, there is this new step here in the checkout that I need to fill in. Maybe I need to fill in more information about the user or something. And then you can fill in that information and then move onwards with the test as if it was kind of programmed to do that from start. Since it’s much better at taking decisions and understanding different kinds of contexts.

Brijesh Ammanath 00:20:59 Can you walk us through a specific instance where an AI agent adopted to a changing application without needing manual intervention? Do you have any examples around such use cases?

Vilhelm von Ehrenheim 00:21:11 We have seen so many examples of this, which is, which is pretty cool. It’s one of the things that I think is the most fun to see when and develop agents. For one instance, for example, we usually have a suite of login tests and then when you kind of come into the application, we have more tests in there. Those logging tests are usually kind of the dependencies for the rest. And we had one of our customers who completely changed their logging provider. So instead of it being like a username password logging that they had hardcoded themselves, they used another third party one to support more different kinds of login scenarios. And our agents were able to run this like completely transparently without any kind of problems at all. We had another example where one of our customers has like a management system for e-commerce stores where they have different where you have the possibility to kind of create warehouses and change stock in there and so forth. And there were some tests where it should continue to configure the warehouse and then for some reason, like something else has happened where they had reset the database so there was no data for it to actually test run the tests on. Where instead of failing the test done, the agent actually went in and created a new warehouse and then continued onwards to configure it. And concluded that that the configuration is still functioning as it should.

Brijesh Ammanath 00:22:33 Very interesting. How do agents decide which tests to run and what areas of the application to focus on?

Vilhelm von Ehrenheim 00:22:40 This I think is a very interesting area and we have decided to focus on analyzing the page and kind of understanding the different components that exist in an application and then ask the user whether or not they want to test those things. So in our platform, you have the possibility to kind of create high level objectives, like in the warehouse case it should be possible to configure the warehouse or in the checkout it should be possible to purchase something and kind of check out the system. And then when we kind of execute those, we discover more and more data about those things and then we can suggest new different kinds of test cases that you could potentially add to your regression suite. And that could be, for example, it should be possible then to add like a cart to some kind of favorites list or storage for later or should be possible to delete the warehouse if you haven’t had this. And then as we run more things and we have analyzed the page more and more, we come up more and more things like this that could be interesting for you to test and add to your test suite. But we rely still on humans to actually kind of take that final decision whether or not they want to run those things.

Brijesh Ammanath 00:24:00 And how do the agents handle test data generation?

Vilhelm von Ehrenheim 00:24:04 AI in general is really good at coming up with things. If you ask Chachi Petit to write you a poem, it’ll do so with a brilliant job. And we see the same thing when it comes to data generation for different scenarios. So even when it’s very specific, so you have things like this warehouse that I mentioned where you have to come up with a lot of different configuration options and very specific details for their specific system. But if you give an AI enough context about what it’s that it’s looking at and what you want to get out from there, so if I want to generate example data for this form and this page is doing this and that and this is the warehouses and they have all of these different configurations and so forth. It would be very good at coming up with interesting test data to use for those scenarios.

Brijesh Ammanath 00:24:57 Right. And also is that a risk of bias creeping in because AI is generating the data?

Vilhelm von Ehrenheim 00:25:04 Yes, there is definitely a risk for bias in general when it comes to AI generated content. I think what you then need to be very mindful about is to help it get the right context that would make sense for your application and the different things that you would like it to think about. But there is always definitely a risk of it being very kind of say for example, only generating names in a, from a certain kind of western country or something and not thinking about the different kind of cultures and things that could potentially be using the platform.

Brijesh Ammanath 00:25:38 And what does bias mean from an autonomous testing perspective? Does it mean that certain test cases are completely excluded and not run and hence there are gaps in the testing?

Vilhelm von Ehrenheim 00:25:48 There’s definitely a risk for that. In general, I think you have the same risk with humans running testing as well, that they have a specific kind of mentality in how they run different tests or how they test different applications. Maybe one QA tester is much more interested in testing SQL injections, whereas another one is much more interested in manipulating state of the application. In general, I think we haven’t seen too much problems in that when it comes to testing it from, higher level objectives. But the, especially when you focus on kind of, I want this specific warehouse functionality to work in this and that way and make sure that it kind of fulfills these things. But it is definitely always a risk of it not kind of thinking of some specific thing and doing the same tests over and over again in a more kind of biased manner than maybe, but it still opens up, I think for more variation and possibilities to variate and kind of test the application more closely to how a user actually experience your application. Compared to normal test automation where you hard code a specific steps even there you and if you generate data for that or kind of come up with a lot of different data, it’s still very much more limited.

Brijesh Ammanath 00:27:08 Right. I was just comparing that to a normal say manual plus automated testing combination. If you had a tester which was focused primarily on SQL injection, you would ideally have the test plan which would ensure that all these areas are covered.

Vilhelm von Ehrenheim 00:27:25 Yes, exactly.

Brijesh Ammanath 00:27:27 Whereas, in autonomous testing, I’m assuming the test plan itself is prepared by AI. So how do you, what approach do you take to identify the biases and identify the misses, the areas that have been missed from testing?

Vilhelm von Ehrenheim 00:27:40 Yeah, we rely on test plans as well. So, essentially what we help the user with is kind of coming up with different test plans and then executing those test plans, but you still have the possibility to work with your test plans. So if you specify that something should function in a certain way and that you expect it to load within a certain amount of time and that it should be possible to do something else after that, say after a checkout for example, you should get an email. As long as you have those things specified in your plan, I think you can definitely be very confident that the AI will do the same thing. But of course, having a complete coverage of your entire application and thinking of all the different ways that you would want to test it is a challenging subject.

Brijesh Ammanath 00:28:27 What techniques are used to optimize test execution?

Vilhelm von Ehrenheim 00:28:32 There are different ways that you can optimize the execution when it comes to agents and the first thing is to collect more context for them so that they can kind of understand and execute things in a smarter way. And then the other thing you could work on is this planning component where it comes to understanding and doing things with a better plan, usually increased performance quite, then there is also the possibility to do different kinds of fine tuning. So for example, if you have very specific application that haven’t been and nothing similar has been seen in the training data of the larger models that are taking decisions, then they might perform really, really badly. And if you then collect data on those things and train your agent to it could be that you train different components or if you train this kind of main execution engine or model to better be able to analyze and take good decisions in that environment.

Brijesh Ammanath 00:29:33 Right. What will make it may be a bit clearer is maybe we think of it from an interventions perspective. So if you have autonomous testing implemented, what are the various interventions where you need the test team to either help set it up either from a collecting the context or in planning or in fine tuning? So what are the ways interventions where you need intervention or some actual human, doing something to make sure the autonomous testing is working as expected?

Vilhelm von Ehrenheim 00:30:11 In general, when it comes to AI systems, and I think this applies here as well, is that you need to observe some data in order to make sure if you have false positives or false negatives. If you do encounter like a false positive, say that for example it was not possible to check out but the agent kind of completed the test anyway. Or if you have the other type of failure where the agent kind of struggled with something and say that it wasn’t possible to do the checkout, even though it’s a functioning feature, what youíve done would have to do is to report those. So it slightly depends on the tool that you’re using and how you could actually do this. But in general, what you need to do is to feedback those issues and then that make it possible for the agent to learn.

Vilhelm von Ehrenheim 00:31:01 So that could either be using different mechanisms like Reflection, which is a way for the agent to kind of analyze positive and negative parts and kind of come up with a better way to think about the problem. Or it could be through fine tuning where you could actually use those as labels. When it comes to kind of reinforcement learning in general, you also have the possibility to do training with verifiable tasks. So if you have a system where you have the possibility to know whether or not it actually succeeded from an outside kind of perspective, then you could use that data as well to train the agents to come up with better planning or better execution strategies.

Brijesh Ammanath 00:31:43 Right. What are the some of the biggest challenges in training AI agents?

Vilhelm von Ehrenheim 00:31:48 I think it is a pretty new tech in general. So the evolution in the AI field has been super rapid over the last few years, but we have still, it has been taking quite a bit of time before we’ve seen agents actually being firsthand citizen in the training. So now today you have tools like Anthropic has released peer use and open AI has released their operator and different agent frameworks, which has put a lot of more effort on an emphasis on training these models based on inputs that are interface based. So when we look a few years back then a Large Language Models were not trained on this. They were primarily trained on text conversations and when they started to be multimodal, they were mostly trained on different kinds of images of the real world and not so much of interfaces. And we had a lot of struggles in the beginning where those models were, were struggling a lot to identify simple things that a human find very simple in an interface like for example, if the pages in dark mode or light mode or where different buttons are if you should hover or if you should click things and all of these things.

Vilhelm von Ehrenheim 00:33:02 But it has become a lot better over the last year, I would say.

Brijesh Ammanath 00:33:06 Have you come across cases where agents failed or struggled to execute a test properly? What was the root cause and how was it fixed?

Vilhelm von Ehrenheim 00:33:16 Usually that boils down to the agent not having context or it’s being a complicated flow in some way such that it’s hard for the agent to understand what it’s supposed to do. It sometimes can be the simple things, but most of the times it’s when it’s the flow is very long and complicated and it needs some kind of knowledge that’s hard for the agent have in its context. But we have seen things where, like in the earlier days it struggled a lot with even simpler things like a certain date picker being implemented in a weird way or things like that. But how we kind of overcome that in general is to identify problematic areas and then try to collect more data on those and mprove our listening structure in order to, to learn the agent how to solve those things. Sometimes it could also be issues with how the browser is interpreting things and how we kind of translate things from the browser to the agent. So there is, there is those kind of areas that you might need to improve as well.

Brijesh Ammanath 00:34:25 Right. And are there common integration pitfalls that teams should watch out for? Do you have any stories or examples around such cases?

Vilhelm von Ehrenheim 00:34:37 You mean in general like integration tools?

Brijesh Ammanath 00:34:39 When you’ve integrated autonomous testing to an existing, you know, test flow?

Vilhelm von Ehrenheim 00:34:44 I think where we have seen the agent struggle is where you have very complicated applications. So some of the things that we have had a hard time testing is when you, for example, have a very complex management system and that affects another system and you want to kind of ensure that those things are happening at the same time, which is hard to do in autonomous testing as well. And I think if the system is very complicated and hard to understand for a human, it’ll be even harder for an agent. So I think that’s still the case even though we’re moving rapidly forward. So I think if it is a very complicated, hard application that is kind of hard to come up with test, I think it might be hard for the agent to succeed.

Brijesh Ammanath 00:35:38 Right. We’ll move to the next section where I want to discuss transitioning to autonomous testing. So how should teams approach integrating autonomous testing into their existing workflows?

Vilhelm von Ehrenheim 00:35:50 I think we see a lot of evolution in the development workflows in general today where more and more AI tools are coming in and I think you should approach it with some kind of curiosity at first. So even if you have a rigorous suite of autonomous tests and you have a good team of QA engineers, there is still a possibility for you to level up your testing and make it even more covering. And what you could potentially do is to just start out with a certain subset of things that you, that maybe your QA team don’t want to focus on as much and maybe you have some set of smoke tests or something that you want to run on each deploy that takes a lot of time to maintain, for example, or then maybe you should try to use AI to solve some of those issues.

Vilhelm von Ehrenheim 00:36:38 I think in general what my recommendation would be to just give it a go and see if it suits your workflow. I think we’ll move more and more into a development workflow that has a lot of AI assisted components. So if you, for example, are curious about code generation and, using agents for solving bugs in your ticketing system, then I think that compliments really well with using AI-assisted testing as well because then you could kind of discover different kinds of bugs, which is very hard for code-based agents or code solutions to do actually. And then you can complement that with a testing solution that interesting issues and problems into your ticketing system and then use modern AI coding solutions to solve some of those features.

Brijesh Ammanath 00:37:35 Can you tell me about the key technical and organizational challenges any client of yours has faced in adopting autonomous testing?

Vilhelm von Ehrenheim 00:37:44 I think we have had some clients that have had technical challenges where it comes to how they run their development flows — like, if you only have feature branches, for example, and no specific QA environment then that’s definitely been problematic. We have had some customers run instead these agents as kind of monitoring solution on your production environment. So that’s possible. But I would say still it’s more optimized towards running and being able to kind of report issues in your staging before you actually do the deployment.

Brijesh Ammanath 00:38:23 And any organizational challenges come to mind?

Vilhelm von Ehrenheim 00:38:26 So organizationally, I would say that the teams the struggle the most are when you don’t really have any kind of testing efforts already. Because then you need to kind of come up with what you, what you actually want to do and how you kind of structure things. I think the teams where you already have maybe one or two QAs or you have engineers that are used to automated testing then is usually well received because they have the similar set already. We do see some kind of QA teams being, being a little bit hesitant to take in automated solutions. I think it’s partly because they feel like they’re being replaced, but I would much more like to think of it as a way to get superpowers. It’s the same with analytics tools or forecasting suites for financial departments. It’s not necessarily replacing those that do it by hand. It’s just like makes them so much smarter and better and I think you could do the same in QA testing. So if you have the possibility to hand off some of those repetitive things that you have to do all the time to an AI agent and that makes it much easier for you to focus on other things and move faster.

Brijesh Ammanath 00:39:41 And do you have any examples where any particular client could not successfully implement autonomous testing?

Vilhelm von Ehrenheim 00:39:48 The examples that I have seen has been either because of their kind of implementation. So for example, some system has a very kind of complicated authentication flows that only needs to function with the KYC solution in that country, for example. Or you have some kind of technical limitation on how you can run things and maybe your data is very sensitive also in your staging environment, for example. And we have had some customers that have had to kind of step away from using it because they didn’t want to share that data to an external provider.

Brijesh Ammanath 00:40:22 When you say technical limitations around data, are you referring to masked data?

Vilhelm von Ehrenheim 00:40:27 Yeah, so it kind of depends on the way that you structure things. Usually what we would recommend is to have some kind of staging environment where you don’t have sensitive data, but some systems have a hard time kind of having test data in that kind of environment where there is data that you haven’t lost in the staging environment or obfuscated or changed in some way then. And then you have, if you have a regulatory kind of requirement to not share that data, then it’s hard to use tools that analyze and execute on that data.

Brijesh Ammanath 00:41:06 Right. And what metrics should teams track to measure the success of autonomous testing?

Vilhelm von Ehrenheim 00:41:12 I think there are several interesting metrics that you could track. I think first of all, when it comes to test execution time, you could measure how long it takes for you to run your entire test suite. If you have a lot of manual work in there, it usually takes quite a long time. So if you have the possibility to remove some of that time to make your cycles shorter, then that’s an interesting metric to measure. You could also, if you have more of an automated testing suite, it’s interesting to analyze the ratio of flaky tests. So if you implement more autonomous solutions, you usually see some kind of reduction in the flaky test ratio. Then of course the holy grail is the bug-detection rate. So, how many bugs do you actually discover and kind of prohibit using different kinds of testing suites. I think those are super interesting to test all of them.

Brijesh Ammanath 00:42:13 Okay. Before we move to the next section, if you can just quickly explain to our listeners what are flaky tests?

Vilhelm von Ehrenheim 00:42:19 Yeah, so flaky tests are, in general, a test that is failing every now and then intermittently without actually being a real failure. So usually in automated testing, that could be that there is like a network hiccup or the page takes slightly longer to load than what is what is kind of coded in. There are a lot of different reasons that could actually be the reason driving a flaky test, but you usually see that a lot in, especially in end-to-end testing because the environment has so many different variables in different states. So it’s usually very hard to hard code a test that takes all of those kind of things into account. That’s also why autonomous testing can help you reduce those.

Brijesh Ammanath 00:43:07 Right. So in my mind’s eye, if I was scripting it out in an automated fashion, I would say, the page could take X amount of time, so I would put in a wait condition and sometimes that wait condition could be more than required or less than required, whereas in an autonomous fashion, the AI agent would decide how long to wait — maybe it would take into factor the network speed and other factors into consideration.

Vilhelm von Ehrenheim 00:43:36 Exactly. It also — at least, our agent also kind of gets visual cues on the page. So if you have like a loader that is still spinning, then it understands that it should wait a little bit longer. Of course, if it takes too long and it times out, then it will see those things, which is a real failure. But if it’s just like something that took a little bit longer than usual, then it handles that gracefully.

Brijesh Ammanath 00:44:01 Makes sense. We’ll move on to the next section, which is the human element in autonomous testing. So in real world implementations, how has autonomous testing changed the role of QA or test engineers?

Vilhelm von Ehrenheim 00:44:15 So I think when you are a manual QA tester, then you usually have a lot of different things that you would like to test in the platform and that that could be very manual. So every time you do a larger suite of tests, then you would have to do the same thing over and over again. So that definitely changes, right? I think you would be more like a QA manager in a way that actually kind of comes up with interesting test plans and makes sure that the AI is executing all of those things and kind of work together with the AI to come up with different test strategies to improve your coverage and so forth. And when it comes to the engineers that are normally developing all of the autonomous tests, they usually welcome this improvement because they need to spend a lot less time on writing and fixing, primarily maintaining those kind of flaky tests that break over time when your application change.

Brijesh Ammanath 00:45:12 Have companies found that testers need to upskill to work alongside AI-driven testing tools? What specific skills are required and how can testers go about upskilling themselves?

Vilhelm von Ehrenheim 00:45:26 I don’t think there is any specific need in upskilling per se, but there is of course a need for upskilling in understanding what tools are available today and how you could potentially use modern tools in your workflows. So reading about new tools and testing them out I think is a good way to kind of make sure that you stay on top and kind of plan to be part of a more modern development cycle going forward.

Brijesh Ammanath 00:45:55 All right, can you share any examples where testers have taken on new responsibilities as autonomous testing was implemented in their company?

Vilhelm von Ehrenheim 00:46:04 The example I said earlier in the beginning where we had a larger team of QA engineers that was kind of downscaled, those that were left in the team needed to spend a lot of time on both manual and kind of scripting up autonomous tests, and they have seen a quite large difference in the time that they need to spend on these things. So they can focus more on planning out and kind of working with the suite and understanding how to test the application in a better way, not so much time spending on maintenance and repeatable tasks.

Brijesh Ammanath 00:46:39 Right. So move from maintaining test suits to more, to do more activities like planning and looking at edge cases or exploratory testing.

Vilhelm von Ehrenheim 00:46:51 Yeah, exactly. And working together with the development teams to make sure that quality is kind of part of the entire development cycle.

Brijesh Ammanath 00:46:59 From the discussion we have had, I’m coming out with an opinion that it’s not truly autonomous. We still have a journey to progress towards true autonomy, so it’s more like semi-autonomous. Would you agree with that statement?

Vilhelm von Ehrenheim 00:47:14 I would agree with that statement. I think AI is not really there yet to be functioning on the same level as a human on these tasks. So I think you still have that there’s a lot of work that can be kind of offloaded to AI, but I still think you need a human to kind of understand and think about how to run your test suites and how to make sure that the application is actually functioning as you intend it to.

Brijesh Ammanath 00:47:41 All right. Well, that’s been an incredibly insightful discussion on autonomous testing. Before we wrap up, do you have any final thoughts or advice for teams looking to adopt autonomous testing?

Vilhelm von Ehrenheim 00:47:54 I think you should embrace the new changes that are happening with open arms. I think there is a lot of fear and kind of skepticism around AI development today, but I think we’ll see large improvements over the next coming years, and I think it will be a superpower to have these AI-capable tools in your tool belt. Even if you’re a develop normal developer, if you’re going to a QA engineer or QA tester, I think it will be a bright future. So stay curious and continue to learn.

Brijesh Ammanath 00:48:30 Thank you Vilhelm for coming on the show. It’s been a real pleasure. This is Brijesh Ammanath for Software Engineering Radio. Thank you for listening.

Vilhelm von Ehrenheim Thank you.

[End of Audio]

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

Show Notes

Related Episodes

Other References:

Transcript

Join the discussion

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts

Search

Search

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

Show Notes

Related Episodes

Other References:

Transcript

Join the discussion

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts