Search
Mark Williamson - SE Radio Guest

SE Radio 693: Mark Williamson on AI-Assisted Debugging

Mark Williamson, CTO of Undo, joins host Priyanka Raghavan to discuss AI-assisted debugging. The conversation is structured around three main objectives:

  • understanding how AI can serve as a debugging assistant;
  • examining AI-powered debugging tools;
  • exploring whether AI debuggers can independently find and fix bugs.

Mark highlights how AI can support debugging with its ability to analyze vast amounts of data, narrow down issues, and even generate tests. From there, the discussion turns to AI debugging tools, with a particular look at ChatDBG’s strengths and limitations, with a peek at time travel debugging. In the final segment, they consider several real-world scenarios and evaluate the feasibility and practicality of AI acting autonomously in debugging.

Brought to you by IEEE Computer Society and IEEE Software magazine.



Show Notes

Related Episode

Related Resources


Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Priyanka Raghavan 00:00:19 Hi everyone, this is Priyanka Raghavan for Software Engineering Radio and today we’ll discuss the topic Use of AI for Debugging. And we look at three aspects of this show. One is going to be about using AI as an assistant to debug, two AI debugging tools. And three, is it possible that an AI, if given a bug, can help fix and do this autonomously? For this we have Mark Williamson as a guest and Mark is the CTO for Undo. He’s also a specialist in kernel level, low level Linux embedded development with a wide experience in cross-disciplinary engineering. He programs a lot in C and C++ and one of his proudest achievements from the Undo website is his quest towards an all green test suite. So Mark, welcome to the show. Is there anything else in your bio that you would like to add apart from what I’ve just introduced you as?

Mark Williamson 00:01:18 I think that’s a pretty good summary. I guess in my time at Undo most of my last 11 years has been a quest to get people to appreciate debuggers more and I’m glad to be here talking about them. They’re one of my favorite subjects.

Priyanka Raghavan 00:01:30 Great. So we’ll kick off the show by asking you to define debugging in a professional software engineering context and how does it differ from simply fixing bugs?

Mark Williamson 00:01:42 Thanks. I really like this question because I think it’s often misunderstood. Developers spend most of that time at the computer debugging. It’s easy to have a view that bug reports come in from the field from customers. They go into a GitHub issue tracker or something like that. They get taken out and a developer fixes the bug. But I would say debugging is the quest to understand what your program is doing and why it’s not what you expected and that starts the instant you’ve typed in your first code. So I would say that most of development is self-debugging. I’ve seen a lot of stats recently that only about 30% of developer work is programming and therefore coding agents aren’t solving the whole end-to-end problem. But I would say that probably 80% of that 30% is debugging, not typing in the code. Code generation is a very small part of what developers do and a lot of the technical work is this debugging process of answering questions and gaining understanding.

Priyanka Raghavan 00:02:55 One of the things I wanted to ask you is what do you use debugging for? So if a program does not function as the way it is supposed to, then that’s called a runtime issue, then that would be something that you would debug. But how about a case when it’s not performing very well? Is that also a case where you would use debugging for?

Mark Williamson 00:03:18 I would say yes. I think different developers might call this a different process, but I would say debugging is any time you are trying to answer what happened or why did that happen and that includes performance issues, but you have to then broaden your understanding of what a debugging tool is. So I would say there’s lots of tools you can use. You can add printfs into your code. They’re often logging frameworks. There’s also system level utilities like Strace, GDB, Valgrind, and perf and even also version control, in order to go back and figure out when a regression came in, I would say performance is in that continuum. So you might use a performance profiler, but actually why did the control flow bring you to the hot path? Well that’s maybe logging or a debugger, and it’s also a question of using each tool’s output to figure out what the next tool you apply should be.

Priyanka Raghavan 00:04:16 That’s interesting. This brings me back to one of the episodes we did in SC radio, which is 35, 44 which was on debugging and the host there had questioned the guest how debugging differs based on the languages paradigms, whether you are debugging a monolith versus a microservice or just the way how you use the tools like as you said. So in your experience, maybe could you talk a little bit about that as well?

Mark Williamson 00:04:45 Yeah, there is a lot of variation I’d say. So what I’ve noticed in the field in my experience is there is a common denominator in everybody’s debugging experience and that is putting more print statements in. It’s probably the first thing you do when you are learning to program, and it carries on. And then there’s the grownup version of putting more print statements in, which is structured logging and open telemetry and things like that. I’d say that is common to all languages and all paradigms of programming. When you get into different more advanced tooling, I think there’s often analogs but it’s different. So most languages have a decent debugger and the tool we call a debugger is only one aspect of debugging, but it typically has some core operations. It lets you step through code; it lets you print variables. The way that works is very different depending on your language and your runtime.

Mark Williamson 00:05:42 So interpreted languages tend to need to tackle these problems very differently to compiled languages. And different languages have different takes on these as well. Same goes I think for any kind of mechanical tracing, any performance monitoring. Most of them also have tools that let you time travel debug these days. But again, the exact implementation and the technique can vary depending on what language you’re talking about. One last point is the distributed case where you’ve got multiple processes. I would say that is just hard. One thing about a monolith, it’s a lot of code to understand, but at least it’s all in one place. Once you start having multiple interacting systems, that is another level of complexity to kind of wrangle and manage even though the individual pieces might be simpler.

Priyanka Raghavan 00:06:30 So I think now let’s move on to the part about using AI for debugging. And it’s been now about two years where we’ve been losing a lot of LLMs to generate code and what do you do once the code is produced you need to debug. So in your opinion, how can AI be used to debug? Is it promising? What’s the, I think message from the field and the work that you’ve been doing?

Mark Williamson 00:06:58 My sense is that using LLMs for debugging is quite early. The thing is you mentioned two years in some ways it feels like they’ve been with us forever at this point and in other ways every week there’s a new announcement or a new change. So it’s literally quite early in the field of debugging that goes back to I guess the first computers. That said, there are lots of places where it looks like LLMs should be good. So by LLM, I mean large language model-based AI, the kind of predominant implementation at the moment, I would say everything that they can help with in debugging boils down to one of two things. I reserved my right to change my mind on this, but right now I’d say one sifting. So you’ve got large quantities of maybe partially structured data riffling through it and finding the right bits, the nuggets that you need to know about or identifying patterns. And the other thing is automation. So the ability to accomplish a set of tasks that would otherwise require toil from you and distract you from the productive work of understanding what’s going on. Maybe there’s a world where the LLM fully solves the bug for you, but I think an important thing to remember with all of this is that it’s great if they can sometimes do a tedious task end-to-end for you, but they’re tools and their assistance. So what we really need to say is how can they help.

Priyanka Raghavan 00:08:33 So let’s explore the part of the AI being a debugging assistant. And here I wanted to ask you, in your opinion, is it more useful for beginners in a programming language who need guidance to use this assistance? Or is it even good for experienced developers or senior engineers to accelerate complex investigations?

Mark Williamson 00:08:56 I would almost always answer both for can a tool help beginners and experts? They potentially help in different ways. What I’d say is that beginners maybe at programming or you can be a beginner anytime you start a new job or move to a new team, beginners can benefit from tools that help them manage the complexity of everything they’re seeing and understand nuances of the code base or little details that they haven’t appreciated or haven’t had time to absorb yet. So I think there’s a lot of potential there for legitimate help, there is also a shallower help which is also important to everyone at some point, which is just answer how to do this or please run this tool. For me, I can’t be bothered to figure out how to write the bash script. That’s also valid I think for people at any level of expertise.

Mark Williamson 00:09:51 If you’re an expert I think say an expert in programming generally or your chosen domain or maybe an expert in a code base, I think that’s still helpful for you. Some of it will be the same. You are a beginner every time you go to a new part of the code base, some of it will be different. So potentially you’d be using it for more sophisticated questions or more sophisticated automations. The other dimension in this is, are you an expert at prompting because all of these LLMs thrive on correct context and high-quality context and a big part of that is ask them the right question with the right details included so it can give you a good answer. So there’s this extra dimension of if you can be good at that then you can be better at everything else.

Priyanka Raghavan 00:10:39 Wait, I really like the line where you said you could be a beginner when you are looking at a new piece of code through even for an experienced person. Yeah. I’ll go on to this next question, which was based on what the answer you just gave. If you look at vibe coding right now where it’s generating large chunks of code, I’ve been using it a lot recently for producing some user interface code, which is not my area of expertise. And one of the areas which I found very useful was since this generates a lot of code and then I run into some problems, sometimes I copy paste the messages of the errors onto my various code screen and then I ask the LLM to tell me what might be the cause of this error. And I’ve seen it’s pretty good right now. I don’t really need to go to Google or Stack or to find this information. I’m using my coding assistant to help me with that. In a similar way, I guess for debugging, would it also make sense that you can copy paste an error or some other things from the call log and it can help you find out, trace out what the problem is?

Mark Williamson 00:11:45 Yes, I think so. Where I first found that LLMs were particularly useful in my development flow is effectively a better search for certain kind of things, a much superior search to what I could do for Google and the kind of problems that I would find it applied best to is where I want to search not just on keywords but on the meaning of the keywords and the context of the keywords. So my previous approach to that had to be hope that somebody has put all the keywords together with the context that’s relevant in Stack Overflow and then my Google search finds it often. Now I can ask an LLM and I can include a lot of semantic information so I can say this is what I’m trying to achieve or this is what I believe the code is doing, this is the message that I’m dealing with.

Mark Williamson 00:12:37 Please give me the relevant information for that case. And because at least in my very crude understanding of LLMs, they are translating all the tokens I gave into some sort of high dimensional meaning space, they can find the thing which means what I meant very, very effectively. So yes, I think they are potentially fantastic for that sort of thing. Once you bring in coding agents and the ability to act on your system and act on your code base as well, they have the ability to search your code base for relevant information and populate that context window with other stuff and then that’s maybe another dimension to debugging more effectively.

Priyanka Raghavan 00:13:15 Okay. So the searching is one angle where your debugging assistant can help. The other angle, which I wanted to ask you was LLMs and coding agents are now being used to generate a lot of test cases. Could this also be used for debugging assistance? And here I have an example where suppose I have a null pointer exception in a service running in production. Would an LLM assisted test case help me narrow down the cause?

Mark Williamson 00:13:45 I think so. The challenge with test cases is often getting them written at all. A lot of developers nowadays I think appreciate test driven development and so I think the situation for test is a lot better than it was. But still it is a truism that things are under tested tests are not written when they should be. The discipline’s important. So I think the first thing that LLMs might do is help us populate our tests sooner so that these problems don’t get out there. But certainly once you’ve got something in production, you’ve got an issue you need to replicate. Feels intuitively reasonable that LLMs could get involved all the way along. So I would think this is more of a continuum probably. You might use the LLM to help write a test case in the first place and try and try provoke a bug or also usefully write tests for things you suspect are the problem and rule them out.

Mark Williamson 00:14:44 But even then once you’ve maybe managed to replicate something, you’ve still got to understand it and at that point what you need is your full suite of tooling. So there’s log analysis again, there’s performance analysis, there’s debuggers. Another thing then that LLMs could do for you there is help bring together all of those tools. And by the way, I’d say testing is another part of the large definition of debugging. I gave earlier writing tests because it helps you understand what and why of the code. I think there’s two sides. So AI can make things a lot worse for us in the sense that you don’t need 30 years of development and a team of a thousand people to make a legacy code base. You can vibe it now, but I can also make it a lot better by taking away toil but also by giving us a smoother transition between tools.

Mark Williamson 00:15:39 So LLMs are very eager to use tools these days and they don’t have the psychological barriers to learning them that humans do. So maybe in this case you could say, well LLM, I have a problem here. Please write a test case to replicate it. It’s done that it deploys maybe some more logging into production for you and you isolate more closely by analyzing the logs that come out. Then perhaps you have to investigate that in more detail outside production, maybe using a debugger more detailed logging again you allow the LLM to iterate on it so it can potentially help you all the way through this flow and take away a load of things that individually would’ve been distractions to the task of understanding.

Priyanka Raghavan 00:16:23 I like that. It’s great. So you can have almost have an LLM as your interface between the different tools and help you find stuff and then pipe it back to another tool and help with the understanding of the debugging problem that you’re trying to solve. So let’s talk about debugging strategies. Is that something that your AI debugging assistant can help with?

Mark Williamson 00:16:48 I think yes. And one very simple way we found they can help is they’re a better rubber duck in our office. Developers are, we’ve got some rubber ducks that used to lie around in the office. Developers have used those and tried to explain an issue to them and in the process solved it. Imagine if the rubber duck had a lot of software engineering expertise as well. You could just bounce ideas back and forth. So I think that’s the first step, is just giving your ideas that you wouldn’t have thought of on your own, doesn’t have to solve the thing for you. Then identifying different possible tools and different ways of applying things is another one. In the maybe slightly longer term as people are using LLM based agents more as well. We touched on how the AI could be your core interface to things, and I think one of the challenges in a debugging strategy is staying in your flow state.

Mark Williamson 00:17:49 So there’s a reason that people love logging and it’s because it’s programming and they’re already doing that. So you’ve just written some code, you want to know what it did. You can type in a few more lines that might take, if you’ve got a big C++ code base, it might take a few hours to rebuild. You go and have a sword fight on a desk chair, but you’re still programming, you’re still in your flow state. I think a potentially very valuable thing AI agents could do is once your flow state is a conversation with the agent, transitioning much more seamlessly into other debugging strategies. So instead of you having to get your head out of coding space and think about perf or think about GDB or think about whatever your logging framework is, even if it’s complicated, just say okay, what should I do next? Please look at the performance logs. Please gather a time travel recording and correlate them for me. And you stay in your flow, you stay in your vibing mindset rather than having to transition between all of these different command syntaxes and output formats, et cetera.

Priyanka Raghavan 00:18:53 That’s interesting. So a way to maintain your context or you don’t have to do so much context switching, right?.

Mark Williamson 00:18:59 Exactly.

Priyanka Raghavan 00:19:01 That’s great. Since I’ve got you on the show, I had to ask this question. Kernel level bugs are supposed to be very difficult to fix. Can a debug assistant help with this?

Mark Williamson 00:19:13 I think yes. In my programming and kernel level stuff, it’s mostly been on the Linux kernel or on Linux derived kernel level code. And I’ve not yet tried applying an LLM to that. But my expectation would be that be very good experience because the LLM is potentially in its training dataset encountered parts of Linux code. It certainly will have encountered documentation about it, mailing list discussions, et cetera. So it will know the context about kernel code that I do not or that isn’t in my head right now. And then it’s also very good at understanding a big complex code base, which of course kernels typically are. So I can see it being very helpful from that side, maybe even for generating some of the code if you can get it to understand the right rules. There’s a lot of written and unwritten rules in kernel programming, but if you can get those in place, I think it could be very useful there.

Mark Williamson 00:20:14 The thing that I’m not aware of anyone having tried is trying to automate your debugging flow. So possibly within easy reach would be add some logging statements, rebuild and reboot some remote machine and then see what comes out. I think you could do that. The really spicy thing I think could be hooking up an LLM to a kernel mode debugger and having it step through the kernel code on another machine. I really haven’t heard of anyone doing that. I’d love to find out if anyone has because that sounds, well, it sounds awesome. It also sounds like an absolute nightmare to manage. So I’d be very interested to see what they could do there, but eventually I imagine that’s what it’ll be like.

Priyanka Raghavan 00:20:57 So now that we’ve looked at that, I wanted to ask you another question. When we always talk about LLM use cases that we’ve seen on literature and also on blog posts, even for the debugging aspects, the languages are predominantly in like Python, JavaScript or Java. I’ve never seen that much about C and C++. What is your experience with using AI assistance for coding as well as say debugging C and C++ code?

Mark Williamson 00:21:29 These days a lot of my coding is in Python and in sort of the glue levels above these low-level systems. So I’ve been using coding assistance in various forms to help me with that, and I found it it’s very useful. One of the advantages I believe that LLMs have for languages that are perhaps more modern and perhaps more scripting oriented is that there’s a lot of code out in the public they can be trained on. So they are excellent at understanding those languages. The flip side is that I’ve also heard, I haven’t experienced this myself, but they can get a bit muddled with what is valid code in a dynamic language. So languages like JavaScript and Python, you don’t have the guardrails of the compiler telling you no, don’t do that. That’s wrong when you do something bad with the type of system. And that’s potentially a weakness.

Mark Williamson 00:22:28 So the nice thing I guess for compiled languages like C and C++ is that you do have the compiler there to give the LLM a telling off and say no, you can’t do that. That doesn’t type check. Try it again. And it gives the LLM some guardrails which is always good. I think one of the things they need is to be grounded in some sort of truth about the system so they can keep being pulled back to that rather than hallucinating. And the other thing they need is good quality context about what they’re doing and what’s going on right now. So in terms of the context, my experience is that coding agents were already pretty good at finding that context in I suspect any language is code based. They know how to navigate different programming language; they know how to navigate a project structure and there’s enough C and C++ out there that they are decently good at generating it and understanding it as well. I guess it’s possible that there are some shortcomings I haven’t seen yet, but certainly they seem effective from everything I’ve tried. The only thing I’d say is that C and C++ tend to also be associated with big scary legacy code bases and they tend to have very unfortunate patterns of bugs and they tend not to have standard logging frameworks. And so it does create a load of challenges you might not see in other languages.

Priyanka Raghavan 00:23:53 Like heap errors and memory battle flows and all that good stuff. Yeah.

Mark Williamson 00:24:00 Exactly, yes. So, few rules in C and C++ compared to what you can rely on in other programming languages, it’s part of what makes it fun and it’s part of what makes it effective for kernel level programming, but it is a double-edged sword.

Priyanka Raghavan 00:24:15 Okay. So let’s now go into some of the tooling for debugging and one of the things that you pointed me to when I was researching for this show was this tool called ChatDBG, I don’t know if I, is it Chat debugger or ChatDBG? What is it? Maybe could you explain that to our listeners?

Mark Williamson 00:24:32 Sure. So ChatDBG is a research paper originally the title of the research paper is Augmenting Debugging with Large Language Models and that’s out of the University of Massachusetts Amherst, AWS and Williams College. What they did was hook up various software debuggers, the conventional forward stepping variable printing debuggers we are all used to and all used at some point to LLMs and I think back when they published this originally they were using the new tool cooling abilities of LLMs. So one of the things that’s become I think quite revolutionary in AI in the last year or two is the ability for the AI to call external tools and that gives the AI the ability to populate its own context window with relevant things and to access the ground truth about the outside world. So what they’ve done is they’ve said, well what if the LLM had access to a software debugger and now it can monitor the behavior of the code using that software debugger and gain deeper insights into it. And moreover, what if we then say, well the user can ask questions not about how to run the debugger but about the actual behavior of the program itself. So eventually you can just ask one of their examples, Y is X null here so it’s natural language which is nice but it’s also a higher level kind of question and not having to compose the operations required in the debugger to answer the question, you just say the thing you want to know and it’s almost more like a query than operating an interactive tool now.

Priyanka Raghavan 00:26:27 Well that’s great. So it’s like when you are debugging in your call stack you can have like a natural language where you can pose a question through the LLM and then it’ll find it out and reply back to you, right?

Mark Williamson 00:26:39 Exactly. Yes.

Priyanka Raghavan 00:26:41 Okay, that’s cool. I think it’s something that we would all be looking forward to and taking off from there, one of the questions I had based on the previous answers where you said it’s possible for the AI to go between languages and almost also between different tooling, right? If you had a very large system that’s built on different components, like what we typically have nowadays, we have some scripting in Python, we have a backend in Java, we have a frontend in Vue.js or React or whatever. And usually a bug kind of spans between all these boundaries. Do you think something like this ChatDBG could help us track bugs across multiple languages and then show us, a recommended approach to fix the problem in an affected module?

Mark Williamson 00:27:29 I think that would be very interesting. I’m not aware of currently any AI agent that will combine all of the complicated parts of that. So the multiple languages, the distributed nature of it, complex interactions, ChatDBG, I think it has several different debugger backends. So you could maybe imagine it talking to a C component and to a Python component and to some other components. The challenge for debugging a distributed system though is also that you need to allow it to run. So using live debuggers, that step can be difficult in a distributed system even when you’ve solved the problems of can I cover all the languages that I need? Can I understand the interactions? Because if you stop one of them then time outs can happen or you can change, you can radically change the order in which things happen. So it’s a challenging area.

Mark Williamson 00:28:23 I also suspect that for quite a while doing this well this sort of varied, varied problem, it’s still going to need human guidance as well because there’s a lot of different things you need the LLM to be smart for and my general experience has been it’s best to give it one thing to be smart about at once. Trying to get it to balance lots of different tasks from lots of different sources without some guardrails is challenging. So your eyes need guardrails to come from your system or you need them to come from the human and I think it’s going to be a case of both of those for some time to come.

Priyanka Raghavan 00:29:02 Yeah, so I think that is a bit of a complicated use case but it’s maybe something that will be solved in the future. I’ll move on to the next question which is I wanted to know about time travel debugging. What is it?

Mark Williamson 00:29:13 Time travel debugging. It’s a vision for how you should debug software first of all. And the vision is that you shouldn’t have to pick and choose what information you get like you do with logging. You should have it all by default. So what time travel debuggers have in common is the ability to record everything your program did and then replay it deterministically typically they can do that in reverse as well. So you can rewind, which I’ll come back to. The trick with time travel debugging is making it efficient. Modern time travel debugging systems are very efficient so they don’t need to single step the program and record every instruction that ran anymore that would be very bad. That would be higher overhead than detailed logging. What they do instead is they use a variety of lower-level tricks in the system to capture only what affects the non-deterministic behaviors at execution time and then replay just those.

Mark Williamson 00:30:12 And you can recompute every intermediate state, so it means every memory location at every machine instruction that ran is available to you now. And what you need to do is then select the variables you want and that’s where the reverse execution comes in. So I like to say that normal debuggers tell you what. Because they’re like a microscope, they let you inspect all of the state in your program and understand exactly what is going on right now a time travel debugger gives you access to causality, I guess. So you can say how did we get here? And that means taking you from what to why. So the real big benefit is to be able to query backwards in time and say well how did this value get set in the past? How did we get into this function call and why did we get in now? So it’s a very broad set of data almost in a way the broadest set of data you could have about your program, and you query it to answer questions all the way from conventional debugging problems to performance problems to stuff that you might otherwise have used logging for but that needs a rebuild.

Priyanka Raghavan 00:31:22 Okay great. So does it work with the trace that we usually use like our logs and traces? Does it work with that?

Mark Williamson 00:31:30 Time travel debug systems usually they work at a lower level than that typically. So there are some time travels like systems which use something like trace data to reconstruct states. But the trouble is in those systems you can only reconstruct what was traced. That’s often not everything. So time travel debugging systems tend to be implemented at a lower level either at the level of the programming languages runtime particularly for interpreted languages or as some sort of just in time recompilation for native languages. So they tend to sit under the level of your code and that’s what gives them the power to inspect and capture everything it does efficiently. What you can do is combine techniques. So potentially you could take a time travel debugger recording and you could extract the same information you usually would’ve got from tracing.

Priyanka Raghavan 00:32:25 Is there a lot of plumbing that needs to be done to support this or?

Mark Williamson 00:32:30 Typically no the integrations with time travel debuggers are very simple and I’d say it’s for similar reasons to the phenomenon where you say well I want to run my code in a virtual machine now or I want to run my code in the container now and you just lift it up and put it there and it works. The fact that the integration of a time travel debugging system is below the level of your code means you don’t explicitly need to change anything. You just feed an extra layer into the system, and you get that extra visibility.

Priyanka Raghavan 00:33:03 Okay, interesting. So it’s like another question because I’m a bit fascinated with this is the fact that it keeps track of say is it at the register level, like what gets written to the register, something like that or a bit higher?

Mark Williamson 00:33:15 For time travel debugging systems that work at machine instruction level, yes. It’s register level state and memory level. But the important thing is tracking that would be horrible. Tracking your register state for every machine instruction would be a nightmare. So what they do in practice, and this is true across a variety of systems, is they capture what was the starting state of your program at a low level. So the registers and memory what information got into your program from the outside world and then everything else can be recomputed and there’s a load of clever tricks you do to make the recomputation efficient because you don’t want to replay everything you recorded every time you want to ask a question, but fundamentally you only need to know what influenced the runtime because modern CPUs are immensely good at rerunning deterministic code very, very quickly. You don’t need to be capturing all of that stuff and it’s lower overhead not to. So it’s smoke and mirrors, we call it time travel debugging, but the real technique underneath the hood is deterministic record and replay and then everything else is kind of magic tricks to provide a better user interface so that it looks like a debugger or it looks like a logging system or it looks like a tool an AI agent can use.

Priyanka Raghavan 00:34:37 That’s great because I would like to just end this section off by asking you a question which I saw on the Undo website, which is that would you be able to point me to what caused a crash 15 years back and the developer who wrote the code has left the company. Could time travel debugging help with this kind of a problem?

Mark Williamson 00:34:59 Absolutely. So I think the reason that that’s a good example is because it’s legacy code. It’s a huge system and it’s something that just starts to happen particularly in big organizations when they’ve been developing for a while and it happens even in the most or perhaps especially the most mission critical, important code bases people have. Because over time you have thousands of people work on these, they work in different generations of programming languages, different paradigms and there’s lots of domain specific expertise. And as we said earlier, any time you go into code that you didn’t write, you’re a beginner again, particularly if it’s a large body of work. So the reason that time travel debugging helps in these cases is it allows you to see the causality, so you don’t have to understand your 10-million-line code base in detail to infer how a bug happened.

Mark Williamson 00:35:58 Instead you can rewind through it so you could say well this value was bad, why was it bad? Rewind to where it last changed. Oh okay I didn’t expect to be in that code path, why we there? And so you can rewind again and find why the decisions were taken there and what it means is that a lot of the domain specific knowledge that you might have needed to ask your colleague who left 15 years ago can be recaptured by understanding what really happened and stuff you didn’t need to know. Like theories you had about the bug that were wrong you don’t need to worry about anymore because you can see that those things didn’t happen. The interesting thing here and it took us a while to realize this even at Undo, is that the problem we’re managing for developers here is very similar to the problems you’re managing for an AI.

Mark Williamson 00:36:49 So it’s provided them with a ground truth of what really happened in the system, provide them tools to navigate it, provide them with the right context and high-quality relevant context and don’t give them irrelevant information because it’ll confuse them. It’s very, very similar phenomena to what we all have when we are trying to get good output from AI. It’s just that humans are much better intelligences and so, the levels of context they can cope with are smaller, they can have, less relevant information, they can fix it for themselves and then they can ultimately have a lot more in their head at once.

Priyanka Raghavan 00:37:25 I think it’s pretty fascinating and I think maybe we will have to add some more show notes on time travel debugging and examples from some of the blogs that I read on Undo. So let me go on to the last portion of the show where I want to talk a little bit about autonomous agents for debugging and what exactly we mean here. I want your take because when I think about autonomous agents for debugging, it appears to me like there’s an agent which does the debugging, which automatically creates the break points, which steps through the code, finds the issue and somehow magically displays that on the screen to me. What is your take on an autonomous agent for debugging?

Mark Williamson 00:38:10 So first of all, I’d like to define what I would say an AI agent is and it’s something that can act on its own independently of you so it can decide to tackle certain tasks or run certain tools and then adapt to their responses in pursuit of a wider goal. And it’s doing that autonomously but on your behalf. So it is acting for you in some ways, your agent. The most common kind of agent we as developers see is the coding agent. These have sort of evolved from what we call coding assistance where it was an incredibly powerful but glorified auto complete into something where it can accomplish software engineering tasks on its own. That’s broadly I think for debugging where things are starting. Coding agents, as we’ve said, debugging is a lot of what coding is and coding agents have taken that on board as well.

Mark Williamson 00:39:11 They can do debugging but it’s fairly early days. The interesting thing I see is using a coding agent Claude Code for instance, I’ve tried to debug a sample problem in the past and spoilers I was trying to get it to use a debugger because I thought time travel debugging would help. But early on what I saw it do was edit my code, add a load of printfs in places it thought was interesting and ask for permission to recompile it. And I mean if it can choose the right places to put the print Fs(?), that’s potentially useful. Again, if you have a compilation time that is seconds or minutes rather than hours, it’s potentially useful. But it did remind me of a terminator chasing a wooly mammoth and trying to bonk it on the head with a bone or something. It was this weird juxtaposition of a very sophisticated modern tool and then pretty much the oldest debugging tool we have potentially though I think we’ll see this transition more towards more sophisticated agentic debugging, debugging by agents, encoding agents are going to be the first place we see that thanks to in a large part this thing MCP, the Model Context Protocol which was developed originally by Anthropic and it’s taken off all over the place.

Mark Williamson 00:40:32 What it amounts to I would say, because I spent a lot of time trying to puzzle how it fits into the system. It’s a plugin architecture kind of no more, no less. It plugs tools into whatever your local LLM client is and there’s no reason those tools can’t include a debugger or a performance profile or something else. The real trick with these tools though is how you get the AI agent to be good at using them. And that’s partly a design challenge for people like me. So how do we make our debug tooling work well with what an LLM agent needs? And it’s partly for the AI companies as well to train better tool use into their products and more broad awareness of tools, better interaction with the MCP protocol and other tool use protocols. And what I’d expect we’ll see is coding agents getting better and better and then potentially specialized agents for debugging certain kinds of problem as well because there’s a different kind of knowledge and flow involved in debugging. You mentioned selecting a debugging strategy earlier you could imagine a hierarchical collection of these things where maybe your coding agent spits out code and then farms out to a specialized ai, I’ve got this problem, how do we solve it that tries different strategies, different tools and aggregates the information together to feedback and then the coding agent acts on that, make some code changes and we try again.

Priyanka Raghavan 00:42:05 Yeah, I like that. I think we’re still not there based on all of the conversations we’ve had so far, but I still had to ask you this. So do you think the future is something like if you have a performance issue, which users are reporting but you don’t really see anything in your traces or logs with respect to a performance issue, but then it’s maybe caused by a third party integration, do you think the different debugging agents, like you have a master debugging agent and you, like you said you have a lot of mini debugging agents doing different things, could this be something like the master orchestrates and finds out the issue In this particular case, because this typically happens where the user reports slow performance speeds, but we have nothing in the logs or any indication in the traces to show that a particular service is acting badly and then you find out it’s not your service but it’s a third party integration?

Mark Williamson 00:42:59 I think. So this is probably not possible yet, but we’ll already be seeing the glimpse of it. And I think one thing which is worth bear in mind is that there’s a perfect rose-colored spectacles world where the AI solve all of our problems and they can solve these things end to end, but there’s huge value to be had in getting them to do the boring 80% of the work to take the toil off so we can focus. So even if we can’t solve the whole thing, having the LLM act as your agent again go out and gather the information you need to make the next decision is still hugely valuable. I think the trick to debugging issues is how you make the AI as smart as it can be. And the challenge for an AI debugging agent is that you have to get the right context fed in there just like a human developer, but more so they need to know as much as possible about what’s going on because they otherwise they won’t be able to answer questions or direct their investigations and if they don’t know stuff, they also tend to hallucinate.

Mark Williamson 00:44:06 And that’s something I think we’ve all seen at this point. Sometimes it’s very amusing but you don’t want it happening in the middle of a production issue and sending you on a wild goose trace. So for this you need the right ability to gather the information and you need that information to be solid. So in the kind of scenario you described, I’d imagine this is, this is probably a tiered approach. Often debugging challengers want a range of tools. So you might start with your inputs are basic performance level monitoring and user input. So user feedback is valuable as well here. Once you’ve started investigating, I’d imagine you’d go down potentially a chain of increasingly sophisticated debug approaches. So you’d initially look at your tracing and you might well automate that and say, okay, LLM, when a performance alert goes off, look at the tracers, see if there’s anything weird.

Mark Williamson 00:44:59 If there’s not, then you’ve got a choice, I guess you can go and look as a human or you can say, okay, go and do the next phase. And the next phase in that world would probably be something like profiling or some lightweight capture of more detailed tracing. But if you’ve got a complex problem with many moving parts or maybe you’ve got legacy parts of the system, maybe that’s not enough as well. So at that point you might move up to two potential approaches. One is write tests and try and replicate it out production that may or may not work. Or for bugs where it’s extremely complicated, you can’t replicate, it only happens in this place that’s, that’s where I’d say something like a time travel debugging system with its capability to fully record and capture the interactions between different services as well would be really valuable.

Mark Williamson 00:45:49 So I think the LLM can help with individual stages, but ultimately the challenge we are facing at each stage is how do we make the LLM for this part of the task as smart as possible. So that’s down to prompting, giving it the right details and giving it then the ground truth about what it’s reasoning about and giving it the right context. And the ultimate of that is when you get up to the full logs that come out of time travel debugging and it gives you the ability to verify what went through the system as well and why things happened. So the LLMs got the power of that, but you’ve got the power as well to go through and check it’s working.

Priyanka Raghavan 00:46:28 I think that makes a lot of sense. The answer that you said, it has to be a layered approach. So let’s move on to the next question I wanted to ask you is one of the worries you have with autonomous agents is introducing any regressions or security vulnerabilities or maybe masking the real root cause. The reason I asked this question is recently I remember seeing this thread on Twitter, which I think RX, which I’m sure a lot of you also saw, which is with one of the databases from one of the companies where a lot of records were deleted. And then when the autonomous agent was posed because that agent produced the different, it inserted rose into the table and then deleted a lot of it. And when you asked the agent probe the agent about this deletion, it lied about it and came up with some fake records, which also happened.

Priyanka Raghavan 00:47:20 So this is a cause of concern, obviously one of the things the team did in that case, I think they’ve added a lot of guardrails around what the agent could do and how much access it had and things like that. Now when you are sort of looking at those autonomous switch agents in the debugging context and where we are trying to solve a problem, again we could run into similar issues, right, where you don’t really, how much do you believe the agent, there’s a certain level of trust, but I wanted to kind of, I had to pose this question to you to ask you what do you need to do to validate that what the agent is giving is right?

Mark Williamson 00:47:54 Sure, it’s an interesting one because yes, there’s so many nightmare scenarios out there where you see somebody who said conversations like why is the database empty? Why did you delete it? And the LLM says, you are right, you did tell me specifically not to delete the entire database. Next time I will ensure that that doesn’t happen. There’s lots of opportunity for unexpected behavior still. Ultimately the AI model vendors I think do a lot of work to try and mitigate this stuff. They do a lot of work on with reinforcement learning to try and align the AI with, don’t lie to the user and don’t do inadvisable things, let’s say follow instructions carefully, but the problem with them is right now it’s not exactly that they’re lying even they don’t know they’re lying. They know the thing they do and the thing they do is try and provide you a good answer.

Mark Williamson 00:48:52 And there are many parts of a good answer. One of them is having an authority and polite tone and another is using the correct terminology for your domain. Another is citing specific examples from your source code, and another is being based in truth and they’ll choose as many as those as they can to get you to a good answer. But any of them might get dropped and one of the hard ones to keep is the truth. So that that is quite likely to be a casualty if there’s not enough information. As we said earlier, I think guardrails are very important and there’s two ways you can interpret the rails as well. There are the rails which stop you, tripping over somewhere you shouldn’t. The safety rails so that will be things like controls on what operations the AI can do.

Mark Williamson 00:49:40 The other is more like train tracks, not in the sense of exactly controlling it, but in the sense of choosing desirable paths. So providing the right information to them. So I guess if we look at the context of introducing security vulnerabilities, LET say you might have a guardrail, which is certain kinds of security scanner that run automatically as in in static checks. So you’re providing that feedback path, that feedback path to an agent is very important because it’s how it learns about the world you’ve put in. In terms of regressions, I’m afraid the answer there is going to be testing as it always is. Better development practices help as well though. And that includes better development practices for the AI. So any static checks you can do will help turning on all of your compiler warnings will help. And also anything you can do to help it understand the real context.

Mark Williamson 00:50:38 So there’s another interesting, we talked about ChatDBG, there’s another interesting project called LDB, which I think is LLM debugger and that’s written about in a paper from the University of California San Diego called† Debug like a Human. And the subtitle there is a large language model debugger via verifying runtime execution step by step. And they showed something really interesting, which is that they gave an LLM that was being used as a coding agent, the ability to step through coding just generated and look and see if it did what it expected or if it had violated in variants that it expected to be there. And what they’ve shown is that giving a coding agent better insight into how the code they wrote behaves dynamically can make them smarter. So I think there’s a whole world here, again in providing better kinds of context and better kinds of ground truth into an AI system because ultimately if you get that right, the AI’s become even smarter than they already are.

Mark Williamson 00:51:43 They’re already very good at coding. But if you can point them in the right direction and give them the things they really need to know, you can unlock more of that capability and you can be using their intelligence for the right things, which is writing a code instead of the wrong things, which is puzzling through gaps in the data that you could easily get for them. The last thing I’m masking, the real root cause, and I think this applies for progressions and security vulnerabilities as well, is I’m afraid people don’t usually like it, but code review, you’ve still got to do it. You’ve still got somebody maybe with AI assistance as well, but ultimately somebody’s still got to check that the tests don’t pass. Now simply because the LLM deleted all of them or the LLM didn’t put in an obvious backdoor into the system in the interest of making something else it thought you wanted possible. So I think there’s got to be, for the foreseeable future, something that looks like our modern software development lifecycle may be AI assisted, but humans in the loop, humans ultimately responsible for making sure this stuff is right and that the right code is written to match the end user’s requirements.

Priyanka Raghavan 00:52:52 I think that’s great. I think that’s a very valid point. How do you trust an output and verify, have some sort of a human in the loop to check the validity of the output also where possible. Yeah, I think that’s great. But I think that brings us to the end of our show. So it’s been a fascinating conversation where we went right from treating the debugger as an assistant tooling to also looking at it being autonomous. So thank you so much for coming on the show, Mark, it’s been great having you.

Mark Williamson 00:53:21 Thank you very much. It’s been great to be here and very fun to talk about my favorite subjects.

Priyanka Raghavan 00:53:24 This is Priyanka Raghavanman for Software Engineering Radio. Thanks for listening.

[End of Audio]

Join the discussion

More from this show