SE Radio 633: Itamar Friedman on Automated Testing with Generative AI

Itamar Friedman, the CEO and co-founder of CodiumAI, speaks with host Gregory M. Kapfhammer about how to use generative AI techniques to support automated software testing. Their discussion centers around the design and use of Cover-Agent, an open-source implementation of the automated test augmentation tool described in the Foundations of Software Engineering (FSE) paper entitled “Automated Unit Test Improvement using Large Language Models at Meta“ by Alshahwan et al. The episode explores how large-language models (LLMs) can aid testers by automatically generating test cases that increase the code coverage of an existing testing suite. They also investigate other automated testing topics, including how Cover-Agent compares to different LLM-based tools and the strengths and weaknesses of using LLM-based approaches in software testing.

Show Notes

Related Episodes

Research Papers

Blog Posts

Software Tools

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Gregory Kapfhammer 00:00:18 Welcome to Software Engineering Radio. I’m your host Gregory Kapfhammer. Today’s guest is Itamar Friedman. He’s the CEO and Co-founder of CodiumAI. Itamar and his team develop a whole bunch of cool automated testing tools and they use generative AI techniques large language models. Itamar welcome to the show.

Itamar Friedman 00:00:40 Hey, very nice being here. Thank you so much for inviting me.

Gregory Kapfhammer 00:00:44 We’re glad to talk to you today about an open-source tool that you’ve developed at CodiumAI called Cover Agent. Cover Agent is based on a paper that was published at the Foundations of Software Engineering Conference. Let’s start by exploring some of the features that Cover Agent provides and how it works. Are you ready to dive in?

Itamar Friedman 00:01:03 Yeah, of course.

Gregory Kapfhammer 00:01:04 Okay, so at the start I’m going to read a quotation that comes from the documentation for the Cover Agent tool and then we’re going to use this sentence to explore more about how the tool works. So the sentence is as follows, it says “CodiumAI Cover Agent aims to efficiently increase code coverage by automatically generating qualified tests to enhance existing test suites.” So let’s start at a high level. Can you explain what are the testing tasks that Cover Agent automatically performs?

Itamar Friedman 00:01:37 Yeah, so I guess you’re mainly referring to types of testing, is it unit testing, component testing, integration testing, etc.. So basically, I think the Cover Agent can try to generate all of these, but the sweet spot that we saw is mostly around component testing. If you provide an initial few test that cover one or more components and you run Cover Agent, it’ll try to generate many more. Actually, it’ll try to generate as many as it could until reaching your criteria, for example, until a certain threshold was met or a certain iteration that you define that you don’t want to run more than a few iterations because there’s also a cost that comes together. Basically what Cover Agent does, it takes the first few tests that was given as part of the test suite and is exploiting these to inspire it to generate more. So if you will come with integration test or end-to-end test of different types, it could be end-to-end test and playwright or Cypress etc.. It’ll try to mimic that If you come with a few simple component tests, it’ll try to mimic that integration test, etc.. So you can aim and try to generate any type of testing. Having said that, it works the best for component testing at the moment, with the current implementation July 24.

Gregory Kapfhammer 00:03:02 So you said it works best with component testing. When you say component, is that equivalent to talking about unit testing?

Itamar Friedman 00:03:09 Yes, the reason I prefer the term “component” is because I think it makes it clear that you can do more than four lines of code, you know, than what you expect from a simple method or something like that. If you’re doing clean code, of course, and if you have for example a class that could have 500 lines of code with just an example five methods actually I think you will be more happier, more satisfied running Cover Agent on that because it’s within the limit of Cover Agent from what we see technically. And experimenting with it empirically to generate good tests for that, for this kind of a setup. And I think that provides more value. Of course, it’s a matter of taste and also situation. Some people they have better appetite for specific methods unit test because for different reason they want them to be fast, they want to verify that each specific method is being highly tested or some would prefer going one level above which is a component. In many cases you gain a lot of the property of unit tests; they’re still fast, they’re still independent and things like that, but you have more opportunities for interesting, you know, behaviors that are capturing more of what the software is supposed to do. So yes, so I believe that component testing is the way to go and Cover Agent covers that quite nicely. So that’s why it’s our recommendation, but you can go with unit test as well considering our terminology, of course.

Gregory Kapfhammer 00:04:42 Okay, thanks for that response. It was really helpful. Now just a moment ago you mentioned this idea of increasing code coverage. Can you talk briefly what is code coverage in the context of Cover Agent, and why is it a good idea to increase code coverage?

Itamar Friedman 00:04:57 Right, so code coverage in the most simple terminology is considering a test suite and considering a set of files, methods, classes, packages you want to cover, you want to check how much of the lines, we’ll come back to this notion of how many of these lines are being covered by the test suite. So, just an example, let’s say you have a test suite including five tests and you have two files with each one having two components and each one has 100 lines. Okay? So 400 lines you want to check the coverage will report, the coverage report will tell you how many of these 500 lines are being covered by these five tests. It could also tell you a breakdown in most cases depends on the package that you’re using to create the coverage report. In most cases it would tell you each method or each package or each file the coverage of that specific component, a specific item.

Itamar Friedman 00:05:55 And then lastly, I want to say that we did talk about line coverage when I explained myself, I talked about line coverage how many lines are covered. That’s the basic metric here, but there are other types of coverage for example, there could be branch coverage, branch coverage, there could be others but just as an example, branch coverage, it means how well we’re covering each one of the branches in the code. For example, an if statement could have two branches and you could in line coverage if you pass this if statement, great plus one, but in branch coverage you want to check if you passed zero then it’s 0% of this line one, then it’s 50% or two of the options of this branch, then it’s 100%. By the way, an if statement could have as many as statement, it could not only one or two and that’s really interesting.

Itamar Friedman 00:06:44 The reason I want to double click on the branch coverage because I think thereís a huge, it was really hard to work with branch coverage but actually LLM opens opportunity to work with branch coverage for two reasons. First, sometimes LLM can help you describe branch coverage and natural language description sometimes itís hard to say which part of the branch, if it’s a large if statement, which part of it how are you passing it? LLM can help you describe the natural language description. And the second thing is usually to pass through each one of the branches it’s really hard to generate those types of tests and we probably cover it a bit later and LLM can help also on that.

Gregory Kapfhammer 00:07:25 Thanks for that response. You mentioned things about LLMs and we’re going to dive into how Cover Agent uses LLMs later in our discussion you talked about code coverage, and you mentioned branch coverage and statement coverage. Both of those sound higher is better metrics where getting more code coverage is better than less. Is that the right way to think of it?

Itamar Friedman 00:07:46 Many would say that code coverage is a proxy metric. I think almost 100% people would say that. But I think that also some percentage of developers and managers would say that code coverage is actually a vanity metric. Practically what we’re trying to say is that did we check that the code work exactly as expected? Okay and now how do you check that? Supposedly you have a spec and if the spec describes any possible usage of your code, basically you can try to convert this spec into tests comprehensively one-to-one fully complete translate of your spec into tests and then who cares about what’s the line coverage rather you actually tested all the options. I’m not talking at only when I saying 100% spec specification, I’m also talking about edge cases and bad cases. Anything that could eventually happen to your software, any data processing, any software runtime that can happen.

Itamar Friedman 00:08:55 So who cares about line coverage? The question is can you write that kind of spec? Can you simulate that kind of situation and in most cases and even in the harshest on the hardware where it’s maybe even simpler in the case of military or space where you really want to, it’s still very hard. And then the question, what’s your alternative? So if you can say that you’ve covered 100% of the lines, then you’re probably in a good place proxy wise because you probably pass through all the options or may a lot of the options. Having said that, I just gave an example where maybe you passed through some if statement and maybe this if statement leads you to the next line but it could lead you there in many different ways and probably you want to check each one of them if you want to perfection your testing and then you want to go different types of branch coverage and statement coverage, things that you mentioned but even then it could be a proxy and even vanity but I think these days nobody until today invented something better.

Itamar Friedman 00:10:03 So the higher is better because you probably tested more statement, more use cases, more flows of your software. Having said that, I think there is opportunities for LLM also to change that but that’s a different answer. So I’ll wait a second and just to finalize, I want to say that there are other metrics for example there is the mutation testing metric, but I think most of them, the industry the standard will say that are complimentary to code coverage.

Gregory Kapfhammer 00:10:33 Okay, that’s really helpful. So you’ve mentioned statement coverage and branch coverage and I think the intuition there is if you don’t cover the code, you’ll never know whether or not there was a bug in the code because you didn’t run it in the first place. Is that the right way to think about code coverage?

Itamar Friedman 00:10:50 Yeah, I agree. basically okay maybe back to fundamentals. Code coverage is in most cases a not being measured or tested or reported with some semantic analysis symbolic execution or something that. It’s usually being measured by actually running a test that actually running the code of the test and that actually eventually, or it could be running the test doesn’t have to be code, if it’s an integration testing etc., it might be API calls or something for script or with some UI. But eventually it means running the test that actually makes your software run and then right, it’s a simple way to say it if you checked each line, if you checked a specific line, hence you ran the software such that this line is being executed, you decrease the chance that that line has a bug. But you know, it doesnít really necessarily say that, let me give you an example.

Itamar Friedman 00:11:49 A test is usually it’s composed of what is the data you want to run through your software and then could be some logic for manipulating your software or manipulating the data and then receiving the output of your software and then checking that output. Basically you can do a mistake and do really bad checks and practically the line was executed of the line of the code that you want to check was executed. Hence the coverage report will say done clear, but you didn’t really test it. Let me give you an example. Let’s say I’m writing a test of a sort function and I’m writing here’s an array and I’m running the array and 1, 2, 4, 3 and I’m getting back the result and I’m checking that the array is Iím doing a property based testing which is not bad and I’m checking that the output of the sort is four, the length is four, right? And the sort function might bring me back 1, 2, 2, 2, it’s not the right accurate, the test coverage could be even 100% or close to it and it’s not working and we’re all happy but no, the sort function doesn’t work. So the fact that we pass through a line is not enough. So although you described really simply I think it’s a bit more complicated and that’s why coverage is also complicated and people say it’s sometimes proxy or vanity.

Gregory Kapfhammer 00:13:14 Yeah that was a great response. So I think what you’re saying is that coverage is a proxy, it’s helpful but it can’t be the final answer. Now let’s turn a little bit to the coverage agent implementation. So where does coverage agent come into play in the story that you were just telling about testing a sorting algorithm.

Itamar Friedman 00:13:35 Great. So in many ways, I won’t double click on or explain all and let’s chat. So first you know writing tests and thinking about happy path and edge cases is a tedious task. if you want to get to above 80% code coverage, even in a simple method go, I’m not saying that this is what you need to do, it’s not my call to action but go to ChatGPT and ask for give me some, you know, snake game or whatever, more than 100 lines of code but not doing more than that and then try to start generating tests that will you know, cover more than 80%. It could be quite challenging. So it’s a tedious task to think about everything. So first you can use Cover Agent first you need to come up yourself with a few tests and then after probably that will be quite easy, you know, just thinking about some, a few simple tests, it could be even a couple.

Itamar Friedman 00:14:29 And then the first thing that Cover Agent will help you is generate a bunch of tests. If you give it a good prompt, sometimes even without prompting it, there is a prompt option, it’s aiming to generate edge cases and complimentary tests and help you think about your testing. Eventually, by the way, although a test suite is supposed to reach you high code coverage, you can choose not to take all the tests and you just shorten your time to think about those tests and think about those edge cases and then okay these two seem too simple, these two I don’t because you know there’s some test smells, there’s cold smell but there’s still test smells etc.. By the way a shameless plug if you need to generate this few first tests and you’re also either lazy or want to use AI for that, that’s one of our other tools code AI ID plugin for VS code or Jet brands, it’s called CodiumAI by the way, the ID plugin helps you exactly with this a few tests, that’s one point that I mentioned just help you with more edge cases etc..

Itamar Friedman 00:15:31 But ideally Cover Agent would have the right frameworks and prompting and guardrails also to help you with the quality. I just talked about test smells. So if we are expecting AI to be, you know, being able to using AI, using LLMs to kind of guide us not to fall into pitfalls, not to do bad practices. Actually I don’t know if you noticed OpenAI just released Critic GPT, it’s specialized supposed to be specialized in that we also have similar models in according I specializing in finding smells and problem etc.. And then if you are not you know, the top testing engineer developer or you’re just tired and this mistake could cost you, then coverage agents supposedly it’s still a tool that needs to be worked on to be frank but supposedly to also help you to apply best practices. So these are two examples why using Cover Agent.

Gregory Kapfhammer 00:16:31 Okay that was really helpful, thanks for telling us those details about Cover Agent. One of the things that I’m taking from our conversation so far is that you’re still expecting the software engineer to be in the loop so to speak when these test cases are being generated so they can guide the Cover Agent tool, they can accept some of the tests and reject other tests and then they can iteratively pursue the generation of test cases. Is that the right way to think about it?

Itamar Friedman 00:16:58 Yes, actually I would to in high level talk about examples that this tool would work quite well and I’m talking about Quell with respect to can it with a click of a button generate a full test suite a good one, in cases that it won’t. And actually if you just expect it to with a click of a button give you a good test suite, it just won’t. Okay. And this debate will actually lead to a point where I believe that we are quite far, at least with the current implementation of coverage and from you know you need that developer in the loop, and it’ll stay that for a long while and this might lead to when not I think the profession will evolve but that’s another question. So great. So I claim that sometimes four lines of code in clean code one method, there is a very, very high chance that you’ll find these exact or very similar phone lines of code somewhere in some open source even if you sync they’re special and by the way a plus or a minus might be a big difference but very, very similar if not 100% we sync we’re special.

Itamar Friedman 00:18:05 I think, software becomes special in two ways. One is how we start combining these methods of the four lines together and sometimes just one minus or one plus or something like that makes a difference. Okay, why did I mention that? Practically if you are doing a small component testing and these components are relatively simple, they take simple strings, simple integers, you know simple structures and manipulate them in ways that you expect to have in the open source. Even if it’s sophisticated types of manipulation, you’ll be surprised sometimes you will feel, oh my god I just wrote two tests I called Cover Agent, I got 100% cool coverage, I actually did it three times with a different prompt. I got three different test suites, and the future will also integrate mutation testing etc. So even tell you which of the three test suites are better, we just talked about cover report not being good enough and you need to compliment it.

Itamar Friedman 00:19:02 So and then oh my God, you don’t need me, okay? But although by the way notice that I did say that you need to come with one or two first tests and even in those cases, we can talk about it later if you want me to, but in cases where your code actually is strongly coupled to unique data that is unique for your company is unique for you know for your software, let’s say you know you’re some kind of entertainment company and you have a special JSON don’t hate me if that’s the structure that I’m using now to run through my software and let’s say you have some Json that comes from a database and you keep flying, throwing into different parts of your software to actually execute what your business logic you’re trying to achieve. If you don’t collect it by yourself and use that part to prompt the Cover Agent, it has very low chance to actually being able to do any proper testing.

Itamar Friedman 00:20:00 Now by the way, I want to challenge myself supposedly you can say that a lot of this JSON structure and data could be covered from the code if we’re talking about regression tests and we are talking, we didn’t talk about it right away, we’re talking about regression tests right now. So because there are if statements there are different string matching etc., so you could supposedly, if you’re intelligent creature, the most intelligent creature, you can infer those structure etc. But first LLM are not sufficiently intelligent right now, and I do claim that there will be many cases that you cannot infer everything from the code and if you even try to, it might open a too large space to actually do proper testing. So one reason just as an example why you can’t avoid right now from working with the developers is that the developers need to provide additional information extracted from the databases, extracted from your Jira ticket, from your specification and whatever and prompt the Cover Agent and then let the hard work of this, like the tedious work of writing different edge cases according to that to the Cover Agent.

Itamar Friedman 00:21:07 And I claim that just collecting these data, it’s not such a simple task. You need to know which data you want to collect and why etc. So we’re still being challenged but I think this is in my opinion the fun part of the testing thinking what the type of the data is I need maybe because of I do in my past Iím a professional. So thinking what the data is I want to provide the Cover Agent to be able to generate tests for me. I think that that’s actually where the creativity and fun generating tests.

Gregory Kapfhammer 00:21:36 That makes a lot of sense. So you’ve mentioned that these large language models are helping the tester to augment their existing test cases. Are the LLMs running on the tester’s computer or are you leveraging some type of cloud-based service that provides the LLM to the tester who’s using Cover Agent?

Itamar Friedman 00:21:56 Great, so we try to provide a tool that is generic to the LLM. It’s hard by the way you will notice that different prompts could benefit from getting better results not just in coverage and anywhere from different models but we try to be very generic as much as we could and we connected the light LLM package the library as open source and that enables the user, if you’re the developer, if you download our tool from GitHub or if installed etc. to basically choose the model. Now interestingly the model right now is almost LLM supports almost any cloud model, but I actually think, and I’m not you’re a bit of change my confidence but I think that light LLM also supports local models and if not, I believe that integrating Ollama for example, is quite easy to do. So just open us in the Cover Agent dab, hey I couldn’t run it locally and probably quite quickly we will integrate Ollama or something. So any cloud model although you might need to play and see what works better for you, I think also local and if not will make it local, it’s not an issue, quality will be different. Okay this is life smaller models at the moment are less good and usually you smaller models, on your computer.

Gregory Kapfhammer 00:23:16 So when I tried the tool called Cover Agent I was able to integrate with a whole bunch of different cloud-based LLMs and we did use light LLM when I was trying out the tool and we’ll talk about that later in our episode. Just to be very clear, when I’m doing code coverage testing with Cover Agent, is it transmitting my code to the LLM that I have picked using the light LLM system and if so, what kind of code does Cover Agent transmit to the cloud?

Itamar Friedman 00:23:48 Yeah totally. In order to provide good context for the LLM, what is given is the existing test suite, the coverage report to some extent and also the code we want to cover. The LLM will provide much better results. Like one of the model operations for Cover Agent is to generate tests that covered lines that are not yet covered in order to increase the code coverage. Therefore the code needs to be sent to the LLM provider in order to see those lines and the context around them as much as possible to actually to be helpful in most cases in order to pass through this branch. So definitely the code and the existing tests would be beneficial. That’s what Cover Agent does as well as a coverage report or data that is extracted from the coverage report.

Gregory Kapfhammer 00:24:37 Alright, so the light LLM tool helps Cover Agent interact with a whole bunch of different cloud-based models, right? You can also use it with some local models and it’s going to transmit a part of the existing test suite and maybe the project source code so that it can generate new test cases that augment the test suite. Now I think at this point we’ve mentioned that there are a whole bunch of Cloud LLMs like ChatGPT or Claude or GitHub Copilot. Before we move into some of the specifics about the implementation of Cover Agent and then how you actually evaluated it, can you briefly compare it to things like, Anthropic, Claude or GitHub Copilot or other tools of that nature?

Itamar Friedman 00:25:21 Yeah of course. Okay so with your permission I’ll really shortly talk about Alphacodium. It’s another open-source tool that we provided that you can with a click of a button generate solutions for code contest competition, like for competitive code competition and do better than the majority of the competitors there. And that’s amazing and if you just try GPT-4 or Sonnet 3.5 etc. and prompt them even use the best prompt you can, you won’t reach even half of the accuracy that you would reach with AlphaCodium. And the reason is that AlphaCodium is more an agentic or if you what we call a flow, it’s like a specific agentic methodology that is designed to generate solution for simple competitive programming problems and within this flow we try to inject the best practices that you would do as a developer. So itís built in guardrails and best practices that you won’t see like out of the box, ChatGPT or ClaudeAI actually applying. That’s why you get the best result there and now apply the same notion to Cover Agent or trying to work with this design a flow or actually Meta did it right in their paper to perfectionize for this task.

Gregory Kapfhammer 00:26:48 Is the idea then that somehow Cover Agent is collecting information from the product’s code base and then making the right kinds of prompts so that I as a software engineer don’t have to figure out how to do that when I’m chatting with GitHub copilot or Claude or some other system like that?

Itamar Friedman 00:27:06 Yeah right. There’s a slight, still as an engineer there’s something inaccurate there and I’m going to fix it but it’s correct. So before I’m fixing it, you’re right, you are going to use Cover Agent by stating you’re the components I want to cover, here’s how I would run the test suite and the coverage report and you know there’s a few more parameters what the coverage I want to achieve and then the Cover Agent would definitely do some context collection that basically would not happen easily with those tools that we just mentioned. But also the second part is that it’ll start a flow that will not happen, maybe by chance it’ll happen with Claude. There is some agentic flow but there’s higher chances that a flow that’s designed to increase code cover generate test, Cover Agent will do that for you. The only inaccuracy is that the context collection from your reaper right now in Cover Agent is relatively basic and that’s one of the, you know, opportunities to improve. It’s much more elaborated than there’s no context collection or relevant one in ChatGPT or Claude. But here it is but it’s still basic as opposed to other tools of ours like Cover Agent or the peer agent.

Gregory Kapfhammer 00:28:22 Before we move on to the next topic, can you give a concrete example of the flow that you had been referencing a moment ago?

Itamar Friedman 00:28:29 The AlphaCodium one or the Cover Agent one?

Gregory Kapfhammer 00:28:32 You could describe it in either Cover Agent or AlphaCodium because I know they’re both tools to help software engineers by using LLMs. But let’s start with Cover Agent. What is the flow that you were referencing?

Itamar Friedman 00:28:44 Okay, so when we came out with AlphaCodium we coined a term “flow engineering,” which was actually accepted by the industry if I may say, an Andrej Karpathy, etc. What we’re saying there is that usually how we perceive AI-empowered system is what I call system one syncing system one systems. I’m referring to Daniel Kaman unfortunately with us for passed away eight, eight months ago or so. It’s basically like you give the system a prompt, like a request, the system collects some context, immediate context, and with one inference we provide the output and that’s what’s prompt engineering about — even RAG is almost that, there’s just a more sophisticated part of context on the collection and system two thinking; the system is not just an inference is running a full flow that could have two to infinite amount of prompts usually not just to give you a dozen of prompting and will also use some tools.

Itamar Friedman 00:29:49 Okay and this flow is not just a chain of thought or the react if you’re familiar: the just give the LLM the full freedom to compute, infer by itself what’s the next step. Rather it’s almost a state machine. That’s why I’m calling it a flow with edges in each node and you say what that task needs to be done. And flow engineering is designing that flow, and then interestingly it happens that the prompt, the variance of the accuracy of the system because changes in the prompt actually reduces. When you’re doing only system one thinking and not system two — with system two is the full elaborated system flow — then small changes in prompts could generate really big differences in in the output you probably have experienced that but if you’re doing flow engineering, a lot of the information and intelligence is the flow itself and usually the prompts are simpler etc..

Itamar Friedman 00:30:46 In AlphaCodium, we put it simple, I’m not going to elaborate too much right now, is we basically designed a flow that imitates the developer. Here’s a problem given as an input. As a developer you would reflect on that, try to understand it, try to sync, how would you test like the solution. Then think about a few solution, try to sync, which is the first you want to try to implement, try to think how would you test that specific one then implement it, then check it versus test and fix it, generate a few more tests, fix that and then output. That’s exactly, by the way, the AlphaCodium flow in a very simple way to description, that’s the AlphaCodiud flow. The Cover Agent flow is very different. To put it really simple, here’s a few test we parse the coverage report, try to inference which are the lines you want to now cover to enhance the coverage, try to generate a few tests. Each test check does it run? You know, does it pass and does it increase coverage? Yes? take the test, rerun the coverage report and iterate. Okay that’s a full flow putting it in a simple way.

Gregory Kapfhammer 00:31:54 Wow, that was really helpful. That makes a lot of sense and I think it is a good foundation for us to dive into more details. Now we’ve already mentioned light LLM and I think we get that that helps the Cover Agent tool work with a hundred plus large language models. I wanted to briefly talk about the tool that’s called Pydantic because I noticed that Cover Agent uses Pydantic and I think that Pydantic may play a role in flow engineering. So can you tell us briefly what is Pydantic and how does it work in the Cover Agent tool?

Itamar Friedman 00:32:27 Yeah, in the most simple way I think Pydantic it helps you to manage typing. Like Python is not inherently very good with types as opposed to some other languages and sometimes it’s extremely helpful that when you pass data to a different method function etc., you ensure that the parameter, the variable etc. receives the input in a certain with a certain style in a certain format, aka certain type. Pydantic helps you manage that in different ways. And you’re right that if we designed a prompt to fit for a specific structure and now if you ruin it in the middle, maybe the entire flow would be ruined. Rather if you kind of stop the flow because of a type of typing problem or you fix the type on the way, you probably still in many cases get a good result. So that’s definitely kind of the guardrail an example of a tool that you can integrate into flows to guardrail the flow.

Gregory Kapfhammer 00:33:34 That makes a lot of sense. And just to make sure that listeners are clear, the Cover Agent system itself is built in Python and many of the packages that we’re talking about are connected to the Python programming language, although Pydantic is actually now at its core implemented in Rust. I think at this point it might be helpful if we very quickly comment on the types of programs for which Cover Agent can generate tests. If Cover Agent is implemented in Python, does it only work for Python or can it generate code for programs in a wide variety of other languages?

Itamar Friedman 00:34:10 Great, so this time I’m going to simplify with my answer and then I’m going to fix myself just to keep it simple at the beginning. So Cover Agent itself is written in Python but the LLM is the one that generating the code and your command that you provide as an input to run that test, to run your environment, to run your code is the one that actually needs to be in your language, etc. So there there’s no theoretical implication or constraints about which language you want to apply Cover Agent on program language except that maybe if this language is too exotic that the LLMs were not trained on that. But I mentioned we did design a flow and it includes different prompting and certain prompts try to deal with different flavors of testing types, testing properties for example in Python you need to have this indentation, right?

Itamar Friedman 00:35:10 And in other languages there are others, you know you need to have a setup method in in some cases and of different kinds and you BO is specific about its style etc. and sometimes the LLM will not output proper tests and the entire idea of Cover Agents is automatically generating those tests, running them, checking them. So you need LLM to output proper tests to start with and then you need to have the prompt that guides the LLM with different properties and it could be that different languages would have higher accuracy, less faulty tests, etc. For example, we mostly checked on Go and Python and TypeScript, JavaScript or probably will do really well there. I think the next language Java we tried really well. Last thing in that’s end you do need the coverage report and part of Cover Agent is parsing the coverage report.

Itamar Friedman 00:36:03 Okay? So that’s also kind of language report or you know ELCA, JaCoCo — or sorry, that’s a funny name; sometimes I don’t say correctly. And then, so basically that’s another limitation. It’s not a real problem. If you don’t see one of the coverage tools there, you can go to the open-source and try to write a parser. So there’s not like a real limitation, rather a current status of the implementation. I think we’re covering ELCA, JaCoCo except there’s three, four, one of them, okay so that’s another limitation that can be overcome really easily.

Gregory Kapfhammer 00:36:40 That makes a lot of sense and in fact it connects to my experience with using Cover Agent. So when I was using it to test some of my own Python code, I gave it a coverage report in the Cobertura XML format, and lo and behold, it parsed it perfectly, and it was able to generate test cases for my project that did in fact increase coverage. So the idea is, if I’m understanding you correctly, Cover Agent itself doesn’t have a coverage tool inside of it. It leverages existing coverage tools that can produce coverage reports in existing well-known formats like the ones that comes from Cobertura or JaCoCo or other tools of that nature. Am I thinking about it the right way?

Itamar Friedman 00:37:23 Yes, correct. Kubatura is exactly one of our favors, you know for example for Python and JaCoCo is definitely the leading ones but you probably might have your own preference that you’re using in the company and you’re probably going to be running them for example as part of your CICD. Part of your build when you want to check that your build is right now passing all tests or whatever process that that you have. We definitely didn’t want to create our own coverage tool and coverage report. Rather we wanted to connect to yours because you probably, if you care about testing and that’s why you’re coming to Cover Agent, then you’re probably already using one of these tools and we prefer to connect to them so you won’t need to change anything and this tool will be useful for you.

Gregory Kapfhammer 00:38:17 Sure, that makes a lot of sense. Now, a moment ago I remember you talking about how it could generate code for a Python project or maybe a Go project and in the context of Python, the LLM might actually generate a test case that isn’t formatted correctly or in Go it might generate a Go test that doesn’t compile. So what does Cover Agent do in those tricky situations?

Itamar Friedman 00:38:41 Yeah great. So the first thing that Cover Agent does in order to accept to validate a test is actually check if it, if it runs, you know, runs, compiles, etc. And the most basic implementation of Cover Agent would just drop that test. A bit more sophisticated implementation of Cover Agent would take the output from the log from the error, you know, message and can try to regenerate and fix that. Why am I saying with that confidence? Because we do have this fixed mechanism in our other tool called CodeMate, but in Cover Agent maybe until you publish it or already be there, but first we wanted to imitate, reproduce the result as much as we could from the original paper and that doesn’t include that mechanism. Right now it’s if it doesn’t build, run, compile we don’t take it. But obviously the more you know comprehensive solution would be to take a look on the automatically have coverage and fetching the error messages and try to fix. You could claim that maybe it’s not the best way to go because who said that eventually runtime wise and cost wise it’s worthwhile to look on the error but in my opinion it is at least some kind of one of the routes to explore because if it didn’t run, if it didn’t compile actually maybe there is some gold to dig there to fix that and reach a new branch, a new line to cover, etc.

Gregory Kapfhammer 00:40:21 Yeah, that’s a really good point and it connects to something that I was thinking about. So in addition to a test case not compiling or perhaps not being formatted correctly in the context of Python and PI test, it may be that the test case is failing. So does Cover Agent distinguish between a test case that is failing because it found a bug and a test case that was failing because it wasn’t set up correctly or something like that?

Itamar Friedman 00:40:47 Yeah. Okay. So right now Cover Agent similar to the paper that you know tried to reproduce is only doing regression test and this is very valuable. For example, let’s assume that you are an Agile team developing a lot of code and you don’t have time to create regression test. At some point usually there are builds that you have very high confidence about them so much quicker than you would imagine. You can generate a lot of regression tests and then moving forward you would know if you’re breaking something or not. So regression test, although their value is limited in certain cases it could be invaluable. Now considering that we’re focusing on regression test if a test pass, we do want it, that’s a short answer. Having said that, it’s tricky. A test can actually pass if the logic is bad and the code is bad and the test can pass if the logic is bad.

Itamar Friedman 00:41:49 But the actually the assert we talked about in the beginning, what checks if that result is right is bad. So it’s very tricky and this is different technology can be applied to try to approve that there are fields and areas, it’s not just two and two, it’s not bad code, good code, bad test, good test and you know, things like that. It opens up to a whole new world of what’s a bad test, bad assert, bad logic, bad data. So it could be a lot of problems there. But the simplest implementation right now is regression test and considering that we did proper prompting and if it passed, we want it. By the way, that’s also one of the reasons you want a human reviewer at the end doing what we talked about in the beginning.

Gregory Kapfhammer 00:42:36 Wow, that’s a really good response and I see exactly what you’re getting at. And clearly generating test cases or augmenting a test suite automatically is a really tricky task to get exactly right because sometimes test cases may be passing but for coincidental reasons and in other cases they may be failing. But if you’re going to focus on regression testing, ideally you want to confirm the existing behavior and you want to increase coverage when possible. Is that the right idea?

Itamar Friedman 00:43:07 Yes.

Gregory Kapfhammer 00:43:08 Okay, good. Now one of the things that I think about when I’m writing my own test cases is that sometimes I have to write mock objects. Does Cover Agent actually generate these various types of mocks or spies overall can Cover Agent make test scaffolding for me?

Itamar Friedman 00:43:25 I would say that mocks is one of the most challenging parts. I think that also what you hinted with proper testing because there’s, you want a test to be fast and they’re not flaky and things like that. But there’s also problem with mocks. Maybe it’s not proper testing of your real production case. It’s really, really challenging. Okay, that’s a side note why I think it is important. Right now specifically for Cover Agent, I’m not talking about there are other agents. Mocks are treated very simple. So in a simple way. If you provided mocks as part of your simple test suite, they would be considered. Otherwise you need to, in most cases, I’m not saying it’s still AI, a statistical creature, it might generate mocks although it’s not being prompted at. But in most cases if you didn’t provide it as part of your original small test way that you’re trying to enhance, then you need to provide a proper prompt.

Itamar Friedman 00:44:22 Although Cover Agent is a tool and a flow and not, you know, a Q&A, there’s still a mechanism, actually two mechanisms to prompt a Cover Agent. One is by providing additional data and in files, etc.. We talk about part of the reasoning for that. For example, if you want to provide data that is expected, because your software is complicated, and the input is coming from some database or something that the LLM won’t be able to think of. And the other is more a traditional, it’s weird to say traditional but more traditional prompting and we manage to inject that into a flow and not the Q&A. So you can use these two mechanisms to try to influence Cover Agent to create mocks because the default is kind of off and it might do a good job relatively speaking, comparing to, and we need to say compared to what, but to tools that are professional in generating mocks, maybe not as good but comparing to a simple ChatGPT or something that.

Gregory Kapfhammer 00:45:21 Before we move on to the categorization of tests that Cover Agent may generate, I remember that a moment ago you talked about something called a flaky test. So briefly can you tell the listeners what is a flaky test and then how does Cover Agent handle flaky test cases that may be returned by the large language model.

Itamar Friedman 00:45:41 This time I’m going to give an example to explain what a flaky test is. If you have a test and you’re running it five times and it doesn’t give you the same result each time, especially if sometimes it’s passed, sometimes it fails, that’s the most obvious flakiness, then it’s flaky. It means how can you trust that test if it passed four times the same test, you know, just run it five times and pass four times and fail one that’s a flaky test. LLMs, you can try to prompt it not to generate the flaky test. I think the prompting to start with will try to do that and because there are tests smells that can be smelled for example if there is an obvious API call an obvious one. If the API fails, let’s say I’m going to take the simplest again example in here you’re calling an exchange API from dollars to euro or something that and that could be flaky, then you’re flaking your test for no reason and that’s an ideal place for example, in my opinion there could be exceptional reason to put a mock or for example, so LLM will try to help you with that if you prompted it.

Itamar Friedman 00:46:51 But there’s a lot of responsibility still in a developer to run Cover Agent on components that, or in a way that to start with the checking independent, components or independent microservices, etc. to reduce the flakiness and again, another responsibility yet for the developer.

Gregory Kapfhammer 00:47:13 Okay, so it sounds like down the road Cover Agent might be able to do things like rerun the test suite or look at deltas of coverage or things of that nature, but in part it still is the responsibility of the software engineer to think carefully about which components they ask Cover Agent to test and then to look at the test cases and to assess whether they’re good ones or not.

Itamar Friedman 00:47:36 Correct. I’m not saying that in five years or so, there won’t be systems that will take more and more parts. For example, a full code analysis of your code base and then suggesting you as a developer here are five areas, five components and two microservices and one end-to-end system that we recommend you to run A, B, C, D tools. One of them is a Cover Agent and then you’re almost, you know, an orchestrator etc. And that will come with an analysis of which part of your code base would probably be less flaky and also very important to test, etc. But right now I think although at CodiumAI we are developing things that, theyíre not as mature developers are being needed. Again, I just hinted that even in five years being needed, but right now they’re needed for different reasons, more fundamental stuff you just mentioned.

Gregory Kapfhammer 00:48:29 Of course what you’re saying definitely makes sense and it connects to my experiences with Codium and also with Cover Agent. One of the things that I wanted to briefly touch on is that when Cover Agent was creating tests for me automatically, I noticed that it would categorize the tests as a happy path test, or an exception test and I didn’t do that. The tool did it for me automatically. So what are those types of tests and how does the tool actually categorize them in that way?

Itamar Friedman 00:48:59 Okay, think about, we talk about code coverage, right? We talked about the fact that if you cover more lines of your code that is you ran tests that ran executed a certain line, then you probably have, a lower chance to that you have a bug and that run through that line. That means that if you want to cover 100% of the lines just as an example, we talked about the problem with cover many different behaviors of your code, many different edge cases, etc. And the part of the flow and the prompting flow engineering we did and the prompt engineering we did in Cover Agents and that’s only going to be enhanced as we progress through time and techniques that tries to generate, you know, these happy paths, edge cases, etc. to cover more and more parts, more lines, etc.

Itamar Friedman 00:50:00 And I think developers appreciate when the thinking of an agent is being exposed that you notice those properties and because we said that for example, the developers need to review the output, it helps you to review it, review it more easily. By the way, when I’m talking about flow engineering, for example, what I mentioned for example, in order to catch not only more edge cases but strong edge cases, I might want to run mutation testing and other tools. Okay? So that’s, and we will output a coding Cover Agent will output some of these. So you could be amazed, engaged and make it easier for you to review the output.

Gregory Kapfhammer 00:50:40 So I think this is the second or third time that you’ve mentioned mutation testing. My impression is that it’s not yet integrated into Cover Agent, right? But to make sure our episode is now self-contained; can you briefly talk about what it is and then what steps you might take to integrate mutation testing into Cover Agent?

Itamar Friedman 00:50:58 Yeah, by the way, I love to do that and maybe that’s a good call to action. We open-source Cover Agent and you can use it and we’ll be happy, and we do have a lot of contribution by the way, from the community, single developers as well as companies, etc. I want to encourage everyone here to contribute and I think one of those especially for professionals that you know, worked with mutation testing, that’s one of those contributions. We love to see that we provided a roadmap. There are different type of mutation testing. I am going to simplify a bit. Also, it’s hard for me not to be accurate. I probably wasn’t accurate. Many places to simplify things. I’m going to do that again here. Roughly speaking, there are two types of mutations. Given a test suite and given a code you want to test, you can, type one mutate the test, type two mutates the code.

Itamar Friedman 00:51:50 Let’s talk about type two. Let’s say I have a test suite that supposedly cover 85% of the lines. And let’s say I have a second test suite that also cover 80% of the lines. How do I know which test suite is better? It could be that they’re even covering the same lines, but they’re very different tests, actually could be different tests. Mutation testing type two, the one that mutates the code could help if, let’s start with something simple. Randomly, randomly mutate the code and the test suite is not failing. Let’s say we start with all the tests are passing and the test suite does not fail any test. The tests are probably, let’s say mutate a lot. The test suite is probably not strong. The test suite that fails more tests as I mutate the code is probably stronger because it’s catching more of the code. Notice a few hyper parameters here. Do I randomly mutate the code? How many mutations I’m doing, which type of mutation? These are part of the thing that in the past we would do heuristically with some algorithms and actually LLMs is one of those, LLMs can help us improve performance on that part. I believe you can imagine what’s type one, it’s different. here you mutate the test. I’ll stop here, let me know if to dive on. Double click on that.

Gregory Kapfhammer 00:53:10 Okay, thanks for that description of mutation testing. I’m sure that will be helpful to the listeners as we start to draw our episode to a conclusion. I want to let them know that they should check the show notes because in the show notes we’ll reference the foundations of software engineering paper that was published by researchers and engineers at Meta. That was ultimately the inspiration for building the Cover Agent tool. What I want to do now is briefly turn our attention to some of the lessons that we’ve learned today. So for example, we’ve talked about automated test augmentation. We’ve talked about the role that the large language model should play in this process. And I’m wondering, can you share a little bit about some of the lessons that you have learned when you’ve been building Cover Agent or the AlphaCodium system or some of the other commercial products that are coming out of CodiumAI. What have you learned about developer productivity and building good developer tools?

Itamar Friedman 00:54:07 Right. So I’ll shortly relate to one learning around building those tools and one, learning about like using developer’s productivity tools. So about developing them, there could be a, it’s a talk by itself, but I do want to iterate on the flow engineering. Don’t try to build a tool at all. All it does, quote a prompt, collect context, prompt an LLM output. Try to build a full system that evolves an algorithm, a flow that includes guardrails and inject best practices that fits for what you are building, okay? And that creates higher accuracy and results as well as less lower variants either of the system or of your prompt that you’re injecting. And by the way, I think this also projects to any other tool that you’re using with AI. Okay? Try to think about flows when you’re building a tool for physician or for lawyers or etc.

Itamar Friedman 00:55:11 About dev tools, I think developers love writing code, they love building and now they can, you know, use code completion to do even a bit more, probably save time and go drink more coffee. I don’t think it’s really helpful for generating more features. But actually let’s think about the real bottlenecks. I don’t think it’s writing more lines of code. It’s about code quality, it’s about code reviewing, it’s about testing, etc. So I think that when you want to start integrating, you know, different AI dev tools to improve your productivity, think what are your today bottlenecks and what are your today problems? For example, if you look on your sprint and you see 50% of your task is bug solving, do you really think that adding code completion will help you to deal with that? Probably you need to find how AI tools can be injected into your workflow to reduce the amount of bugs, to reduce the quality, etc. So that’s my main suggestion. Don’t think about the tools, think about your pains and problems, and then search for the existing tools to try to solve that.

Gregory Kapfhammer 00:56:21 Wow. This has been an awesome conversation and what you said a moment ago was really thought provoking. If listeners want to learn more about some of the things that we’ve talked about today, GitHub Copilot, they can refer to Episode 533 of Software Engineering Radio. I’ll also link you to some other episodes where we’ve covered some affiliated concepts. But with those points in mind, now that we’ve talked about AlphaCodium and Cover Agent and you’ve given us a call to action, is there anything that we haven’t yet covered that you’d to share?

Itamar Friedman 00:56:54 I just want to say that I really enjoyed very detailed discussion about the Cover Agent and that what I want to encourage people is to play with these kinds of tools. Play with them, the profession I’m not saying it’s going to evolve tomorrow, but it’s going to evolve, try downloading these tools, playing with them. Cover Agent is literally one minute to run, and you’ll notice that there is one part that is hard to actually navigate, is that command that you need to provide to Cover Agent on how you want your test to run in your environment. Because the idea that Cover Agent needs to run, you know, in the environment where the component needs to run and then it will take you 50 more minutes. Okay. You probably have that environment to some extent and then within an hour you don’t, don’t stop yourself. Try these kind of tool Cover Agent and do that, in my opinion, weekly because some of them suddenly going to have leapfrog effect and you want to have those in your arsenal as a developer. Everyone, all mold developers, I think are going to be AI empowered developers in the very near future.

Gregory Kapfhammer 00:58:02 Yeah, that’s a great point. I want to emphasize for our listeners that you can download the Cover Agent tool from GitHub and I tried it out and Itamar, I have to agree. It was super fun. I tried it with a whole bunch of different LLMs. I tried it on different projects, I tried it for different languages and it was super cool to see it generating all of those test cases for me and in some cases I could integrate them into my existing test suite. So let me say thank you for this conversation. And additionally let me say thank you for open sourcing the Cover Agent tool. It’s been a lot of fun for me to try it and it’s been great for me to chat with you today. Is there anything else you’d to pass along to our listeners before we conclude the episode?

Itamar Friedman 00:58:45 I don’t think so. Thank you so much.

Gregory Kapfhammer 00:58:46 Well thank you for talking today. This is Gregory Kapfhammer signing off for Software Engineering Radio. Goodbye everyone.

Itamar Friedman 00:58:53 Thank you very much.

[End of Audio]

SE Radio 633: Itamar Friedman on Automated Testing with Generative AI

Show Notes

Related Episodes

Research Papers

Blog Posts

Software Tools

Transcript

Join the discussion

More from this show

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

Menu

Recent posts

Search

Search

SE Radio 633: Itamar Friedman on Automated Testing with Generative AI

Show Notes

Related Episodes

Research Papers

Blog Posts

Software Tools

Transcript

Join the discussion

More from this show

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

Menu

Recent posts