SE Radio 675: Brian Demers on Observability into the Toolchain

Brian Demers, Developer Advocate at Gradle, speaks with host Giovanni Asproni about the importance of having observability in the toolchain. Such information about build times, compiler warnings, test executions, and any other system used to build the production code can help to reduce defects, increase productivity, and improve the developer experience. During the conversation they touch upon what is possible with today’s tools; the impact on productivity and developer experience; and the impact, both in terms of risks and opportunities, introduced by the use of artificial intelligence.

Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Related Episodes

Other Resources

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Giovanni Asproni 00:00:18 Welcome to Software Engineering Radio. I’m your host, Giovanni Asproni, and today I’ll be discussing observability in the tool chain with Brian Demers. Brian is a developer at Advocate El Grado, a Java champion and Apache member who contributes to the directory of Maybank and Shiro projects. He spends much of his day contributing to OSS projects by writing code tutorials, blogs, and answering questions. Brian, welcome to Software Engineering Radio. Is there anything I missed that you’d like to add?

Brian Demers 00:00:47 Yeah. Well thanks for having me. I think that bio pretty much covers it. I think the only other thing I’ll add is most of my career has kind of been in and around the Java world, so everything I think about sort of skewed a little in that direction. But I think everything we’re going to talk about today applies to just about any developer.

Giovanni Asproni 00:01:02 Okay, so let’s start. Now, tool chains in software development, we have several tool chains, compiler tool chain, build tool chain, CICD tool chain, lots of tool chains. What tool chains are we referring to here?

Brian Demers 00:01:15 I think the answer is yes, all of them. So I think the issue really is exactly what you mentioned. We have all of these tools, they’re complex, they’re all kind of tied together. Some of them are stitched together better than others. So we really need a better way to understand them.

Giovanni Asproni 00:01:30 Okay, weíll be talking about all the tool chains we’ll be using here. Good. Now when we talk about observability in the tool chain, what are the similarities and differences with the observability of systems running production? Because usually when we talk about observability, we think about the system in production and the various events, logs and things we can get to understand what is happening there. So are there any similarities with this?

Brian Demers 00:01:56 Yeah, absolutely. So if we think about our production applications, you wouldn’t deploy a production application today without any kind of monitoring or anything behind it. And a lot of that is because they’re complex. We don’t really know the load on the system or all of these other things we’re unaware of until that, I donít know, Black Friday event happens or whatever. Our build tools are very complex systems. Our build chains, everything are very complex. You have all of these things happening, downloading dependencies, compiling, linting, unit tests, integration tests, whatever. I mean there’s security scanning, lots of things that can go wrong just like our production applications. So to me it’s kind of mind boggling to think about that we’re just now getting to a point where we’re treating the complexity of our build tools the same way as we would our applications. So if something goes wrong in your build tool, how do you deploy something to production? Well, you can’t. So if your build tools are actually in line, they’re on the path to getting something deployed to production.

Giovanni Asproni 00:02:55 And also is there any relationship with the shifting observability left? This is something that you find a lot nowadays that is pretty much feeding observability data from the system production to basically the developers for the day-to-day activities. Is there any relationship between observability in the tool chain and this shift left of the observability of the system?

Brian Demers 00:03:17 Absolutely. I think one of my favorite things to talk about is flaky test, for example. So flaky test, I may mention this a few times, but if you haven’t ever been aware of flaky tests, it’s a test that essentially will pass one minute and then without any code changes will fail the next minute or vice versa, or works in my machine but not on your machine, it works on CI but not locally, whatever it is, there’s some inherent flakiness. So the earlier developers can know about those types of problems, the better they can spend their time. So I find working on a feature and there’s a flaky test for example, that has nothing to do with my code that I’m changing. Just the fact that I’m aware of that will make it so I don’t go hunt down that issue. Now there’s a lot of other problems with flaky tests. We can talk about how fixing them and how to monitor them and all that. But essentially that’s just one example of a developer having knowledge earlier in the system is better.

Giovanni Asproni 00:04:08 Okay. And in terms of problems that these observabilities in the tool chain now, what problems are we actually trying to address with this? So you mentioned flaky tests and maybe also security scanning, some of that stuff. But in general, what are we trying to achieve here?

Brian Demers 00:04:23 I think one of the biggest ones that I know at Gradle we talk about a lot is developer feedback cycles. So oftentimes we run into issues where builds are hours long. I know of a lot of shops who don’t run their tests locally, they let their CI systems handle everything so they can’t even run their tests. So their feedback cycles are incredibly long and very disconnected. So again, that kind of goes to shifting left. If I can’t even run my tests, then there’s no way I’m going to be able to be productive. So I’m going to make some code changes, I don’t know, throw some salt over my shoulder and then push the CI. Or push my Git repository and then CI’s going to pick it up and then maybe an hour or two later I’ll get the feedback for my tests. So there’s no way that I can be productive in that environment.

Brian Demers 00:05:05 But even in the case of if I can run tests locally, if my feedback cycle is let’s say 20 minutes, I run a build, I make some changes and I run some build. So whatever that includes, linting, compiling, testing, whatever your build includes, all of those things. If my feedback loop is 20 minutes, well how many times can I do that a day, right? And some people listening to this may be thinking 20 minutes is fast. But even if you can get that down to two minutes, just think about if you and I were silent for two minutes. People would be checking their phones or whatever to see if this thing is actually working. Or if you were having a conversation with somebody and you waited for two minutes before you said the next word, you’d probably just turn around and walk off. So we really need to think about feedback cycles being as fast as possible, but again, using from the perspective of observability, we need to have data to figure out how long those feedback cycles are. And then of course data to know where the bottlenecks are to improve them.

Giovanni Asproni 00:06:04 Okay. So it is really about here efficiency and productivity. Am I correct? These are the kind of thingsÖ

Brian Demers 00:06:10 I think that’s one angle. I think the biggest angle we see is definitely productivity, but I think there’s a lot of other angles too. So just by having data, something goes wrong, you can figure it out. So maybe you can think of efficiency as in troubleshooting efficiency, but also if you don’t know what dependencies you’re using. So you’re just not aware of what is in the system to begin with.

Giovanni Asproni 00:06:31 That is actually quite an interesting one. I think in the Java world probably it’s a problem that is very often presented quite aÖ

Brian Demers 00:06:38 Itís a big problem.

Giovanni Asproni 00:06:39 I remember having class paths with different versions of the same jar in the same class path because people forgot to remove some dependencies and then strange behaviors happening.

Brian Demers 00:06:51 Absolutely. I mean, we still see those today. So that’s a problem that, and it was probably more popular earlier in my career, but we still see echoes of that now where things change packages or namespaces. So you can still end up with basically two of the same things on your class path, which just adds to the complexity. And without a great deal of knowledge about how the system works, it’s really difficult to find unless you have some tooling around the system.

Giovanni Asproni 00:07:16 Okay. And so I guess these aspects of observability around also well is about developer productivity engineering as well, isn’t it?

Brian Demers 00:07:26 Absolutely. Absolutely. So developer productivity engineering or DPE is basically all about engineering our way into making our builds faster or basically the feedback cycles shorter. So I can get more done, I can stay in the flow. As a developer, what I really want to do is write code. So if I’m spending a long period of time not writing code or not waiting for a long process to happen, that’s less time I’m going to be active or excited about whatever I’m working on.

Giovanni Asproni 00:07:54 Okay. So it is not only about productivity, but it’s also about developed better experience then. So it’s both things.

Brian Demers 00:08:00 I think those two things are very related depending on who you talk to, depending on how much they like software. But I think if you’ve been in the industry for more than a few years, you know if you like writing code. And if you don’t, you’re probably migrating away from banging on the keyboard every day. But if you do enjoy that, then you want to be productive. You want to learn, you want to get the little hits of dopamine every time you get some feedback. So maybe you run a test in your IDE, and it goes green. That little bit of feedback is exciting to your brain, and it makes you happy. And I keep saying that a happy developer is a productive developer. So it’s a win-win all around win for me because I’m a happy person, it’s a win for the company that I work for because I’m more productive, essentially me being productive saves them money or makes them money depending on how you look at it.

Giovanni Asproni 00:08:47 True. Yeah. Yeah. I can see the usefulness of this solving some of those irritations that you have every day when tools don’t work well, or you have to wait for stuff.

Brian Demers 00:08:57 Absolutely.

Giovanni Asproni 00:08:57 Yeah, I can understand that. I can see it being a developer myself. And in terms of observability here, what are we observing? So can you give us some examples of what kind of information data we are looking for here?

Brian Demers 00:09:11 Sure. I think it depends on what your problems are, but essentially you can kind of think about them as in, I guess the original three pillars of observability. So I think starting with logs, logs are the easy ones. Build tools, have all kinds of logs, in some cases too many logs. So if you have a million line of code project, regardless of what your build is doing, that’s going to output a lot of logs. So how do you sift through that, right? And it’s the same with our applications in production. If you have a lot of logs, you need tools to just cut through them and sort them and whatever. But yeah, so tracing, so where your bottlenecks in your build. So in a lot of cases it’s testing. We’ve seen a lot of shops where 80, 90% of your build time is testing.

Brian Demers 00:09:54 So whenever I talk about builds, I include test time. I know not everybody does that because a lot of people like to skip their tests, but I won’t go down that path. So finding bottlenecks in your builds. And then metrics like build times, how long are your build times? If build time is the issue, if you’re measuring feedback cycles, if you don’t measure what those feedback cycles are, how are you going to improve them? Or how do when they get worse or better? So how do you report success to your boss who’s paying you to improve your feedback times or whatever?

Giovanni Asproni 00:10:24 Yeah. Okay. And in terms of what we can do currently, what kind of observability options the tools currently offer?

Brian Demers 00:10:32 I think right now there’s not a lot out there. I mean, I know our product, so I worked at Gradle. We have a product called dev velocity works in the Java world that obviously specializes in this very thing we’re talking about. But in general, I don’t think that tooling is very robust in a lot of systems. And that’s all the way down to the individual tools themselves. Like compilers, linters, they may not even output a lot of data that you’d want to consume. So I think we’re starting to see more robust data come out of these tools. And I think it goes hand in hand with a lot of the supply chain security that’s happening too, because we want to know what goes into the system, we want to know what comes out of the system for security reasons, but we can also use that same mechanism to gather information to improve the lives for everybody.

Giovanni Asproni 00:11:17 And so at the moment, so you said the tools don’t really offer much. So in practice, is there something that we can still do with what we’ve got at the moment in terms of observability in the tool chain?

Brian Demers 00:11:27 Absolutely. Yeah. So I think even without specialized tools, just measuring how long a process takes or individual executions in your pipeline take. You could do that, dump it to a file or dump it to a database somewhere. And that’s a pretty low hanging fruit. It’s not a lot of data, it’s not rich data. But if you don’t have anything that’s a huge improvement. And then at least you have enough data to make decisions on where to invest or where to spend effort in collecting more information.

Giovanni Asproni 00:11:53 With all these tools. You mentioned compilers, you mentioned tools to basically security scanning, maybe test runs, things like the break or instead of running and all this stuff. So several tools, different data, potentially an enormous amount of data. I’m thinking also, especially like logs, I remember well in the Java world actually, when you compile surface common to see logs, warnings, so all sorts of stuff lots of information there. How are we supposed to collect this data, make sense of it? So with the current tools, you said we don’t have much, but how can we make sense of it?

Brian Demers 00:12:27 That’s a great point. So let’s take the example of a compilation, either, I don’t know, maybe you want to talk about warnings or deprecation or whatever it is. So if you are running a build, right, and I’m just going to wait two minutes or 10 minutes or an hour for my build to happen, I’m not going to sit and watch those logs go by. And I’m not going to go proactively look for when that happens. After an hour, I’m not going to go back and scroll through every line looking for compiler warnings, I got a green light, it passed. So the real way to fix that is either make the warnings errors. So that’s kind of like the hard heavy-handed approach. Or have some systems in place where you’re tracking the number of warnings you have now and then you’re reducing them. So you have to have some system sitting on top that looks for this. So say you have a hundred compiler warnings today, and by the end of the next three months you want to get it down to 10 or zero or whatever. So you have to track that number. And you have to watch it go down. So you have to have some tooling that collects that information and report on it and be actionable on it. Because if nobody looks for that data and nobody’s going to act on it.

Giovanni Asproni 00:13:35 Yeah. And so have you seen anything at the moment, any teams, any real projects where people are doing this at the right now? Have you experienced anything like that?

Brian Demers 00:13:44 I think especially in the case of flaky tests, so going back to testing, I think that’s a huge problem. So it’s a similar issue. It’s a giant time suck for some teams. I’ve worked at shops where the flaky test notification was an email that went out to all of engineering. Well that’s fine. If there’s like 10 of you, it’s probably less fine if there’s hundreds of you. Or thousands of you, that system does not scale. So the ability to collect that information and act on it is critical. So you need to, in the flaky test, whatever your company wants to do, whether it’s move those tests into quarantine, disable the tests, assign somebody to fix the test, whatever it is. Rerun the test potentially. If it’s flaky, maybe just rerun it a bunch of times and will pass. That’s an okay stop gap. But you need to have a way to both work around these issues and to continue on to improve.

Giovanni Asproni 00:14:35 Okay. Now also, as you probably know, there is a lot of talking about the use of AI in software development. It seems that, I mean, it is the most discussed subject among developers at the moment with all sorts of opinions around it. But something that I think we can say is that the use of AI during the day-to-day work actually introduces some interesting challenges in terms of observability in the tool chain. Because now that is generated AI code. So perhaps we need to know the provenance, maybe for security reasons or other reasons, and maybe copyrights, who knows? But anyway, in general, what do you think are the issues introduced by this new tooling in this respect?

Brian Demers 00:15:17 Yeah. I knew we were going to talk about AI. Every conversation now has AI in it. And like you said, love it or hate it. I think it’s here, it’s not going away. We have to get used to it. So we have to be aware of these types of challenges. The Providence one is a great one. So where does this code come from? So there’s a whole bunch of tooling around sort of Providence and access stations that we could get into. But that’s one way to record where information came from. But just the sheer fact that more code is going to enter the system means you have to have smarter tools around it. So let’s say you’re using, I don’t know, Copilot. So that’s just helping you write code. So you’re the one writing code, Copilot’s helping you. We have no information on where that code came from.

Brian Demers 00:16:00 But is Copilot helping your organization? I don’t know. You can ask developers, they’re like, yeah, I like it. But is it actually helping you deliver features faster? You have to measure something. Maybe measure the number of features you push to production or how often you have pull requests, whatever it is, that’s some measure. So you need some tooling around that to measure the productivity of that tool, for example. But what else? I mentioned that more code is coming into the system. So that means we have more code to compile, going back to warnings. Maybe there’s more warnings. Testing becomes more important. And then maybe even policies like, so if I’m generating code and generating my tests, what have I done? I haven’t actually done anything. Like I’m just running some generated code and generated tests. Those generated tests aren’t doing anything for me because they’re generated from the same source that ran the tests. On one hand, it is good because it’s going to detect if it changes the behavior in the future, but I don’t know. Anyway, I’m going to get into some philosophical issues here in a minute. So I think AI adds a tremendous amount of complexity into our systems just from the sheer volume it’s going to add.

Giovanni Asproni 00:17:08 So how can observability help us dealing with this complexity? Or at least understand what is going on and decide if it is a good thing or a bad thing depending on the situation?

Brian Demers 00:17:20 Yeah. So again, I think it comes back to being able to track data. So if your code base is growing end over end, your build times are going to get longer. That’s just how it works. You have more code, it takes longer to do whatever it is you’re going to test. So you have to have some sort of data around that to really look at and figure out, is this how we want to structure our code now? If our code base is growing, I don’t know, a million lines a year, whatever, 10,000 lines a year, whatever that number is, is that sustainable? Do I need to break up my project into multiple projects? Spread them across repository, spread them across teams, whatever it is, you still need data to make smart decisions based on that. And maybe it’s not an issue. So that’s another good thing that data can provide you. So maybe you are adding time, but you’re using other techniques. Like there’s a bunch of build acceleration techniques where like build caching, compiler avoidance, predictive test selection, all these other types of techniques that are out there that can speed up your build. So maybe that increased amount of code that you’re adding doesn’t actually cause kind of a regress in the developer experience.

Giovanni Asproni 00:18:24 Do you think also that AI can actually help as well with this observability or in the tool chain?

Brian Demers 00:18:31 Yeah, I think there’s a number of ways. I think just the pattern matching alone from AI, does this failure look like this failure? Or should say, does one failure look like another failure and categorize them? So in testing for example, if you could lump test failures together, then a developer could probably make really informed decisions on which types of tests are more important to fix first. Now, if there’s a flaky test, that’s one thing. If there are test failures, because I made a code change, that might be a different problem, but maybe it’s an infrastructural problem. So a lot of builds are really complex or downloading things from various systems, but the internet’s flaky. It’s just how it is. There’s always going to be connection errors, servers are down, whatever. So there’s probably a lot of analytics that can happen. Just analyzing, are these types of problems, infrastructure problems? If so, that’s not a developer’s problem. Maybe we can automatically rerun a build for them. Maybe we can signal, hey, we think this is not your fault. Don’t worry about spending the next hour of your life trying to figure out why it’s your problem.

Giovanni Asproni 00:19:29 You are referring to things like downloading some dependencies with Maven or Gradle or something else that could be subject to flakiness if you are not cached them locally, for example. Things like this.

Brian Demers 00:19:40 Yes, absolutely. And how we do, CIS sort of changed over the last few years, but over the last 10, 20 years anyway, so we used to have these long running agents. Now everyone favors ephemeral builds. So now we’re downloading more of the world on every build. And of course there are caching techniques around that. But again, observability data, knowing how long your build takes, knowing what portion of your build was downloading dependencies might actually be useful. So if you have a large dependency download time of your every build, maybe that’s a signal that that build isn’t configured to use a dependency cache or whatever. So I know GitHub and Jenkins have a cache system they can use to save your dependencies. There are other tools, like I said, build caches and things that work a little different. But that’s one indication of having data helps you pinpoint those projects that potentially have problems.

Giovanni Asproni 00:20:32 Okay. And have you seen any teams doing that, implementing some observability for the AI they’re using? For example, have you seen anything?

Brian Demers 00:20:41 Yes so, we have a lot of customers that are doing exactly that. Dependency download time is a huge problem for a lot of our customers for exactly what I just mentioned. They were on Jenkins two years ago, where they had these really long running agents and they switched to ephemeral builds. And look, now their builds are so much longer. What happened? Why isn’t the new system working better than the old system? Well, it ends up being because your build actually changed. Now you’re downloading a bunch of things. Or we had one case where the network file storage that the build agent was using was slower than the previous system. So how do you track that? The disc io is essentially slower. So without monitoring the disc io, there’s no way a developer’s going to know what happened. First thing what’s going to happen is, well, I like the old system, the new one sucks. Right? Well, that’s not the true story. The new one isn’t configured the same way. So maybe we should fix that.

Giovanni Asproni 00:21:34 Okay. Something again related maybe to AI again. We mentioned this, the code provenance, but now with all these developers using AI, well, according to some surveys said, is like 90% of developers in the world are using AI. But I’m not sure how much we can trust those numbers, because usually they come from companies with an interest in outcome. But there are still lots of developers using AI systems. So how critical is code provenance tracking in these tool chains for you because of this usage?

Brian Demers 00:22:05 I think it’s very critical. I think we’re starting to see more interest in the open-source community. I think that’s kind of where it’s going to start. Because as companies consume open source. I don’t know what percentage it is. I think Tide Lift put out a report a few years ago, but it’s some insanely large percentages of your applications are open source. So I think open-source is also the area that moves faster, adopts new functionality faster. So what we’re seeing is, I think open-source foundations are going to be more interested in providence of source. So I know for example, when you bring a new project to Apache, there’s a big audit as far as do we have the capability of taking on the ownership of this source? So that’s, I have some source, I’m bringing it one place, but as we’re incrementally adding source, nobody’s really doing anything.

Brian Demers 00:22:52 There are a lot of snippet detection in a lot of these tools out there, but I think they’re questionable at best. So yeah, so anyway, so I think we probably should be doing more for tracking providence of code from individual developers, from AI, from systems, wherever it is. Personally, I don’t know of tools that are doing that now, but I do know we’re starting to get into the world where we’re tracking the providence of where things were created. Where things were built. I don’t think we’re going to get to the individual lines of code until we’ve solved the output problem.

Giovanni Asproni 00:23:24 Well, I guess at some point we’ll need to get there because have you seen what is happening also with some open-source software that some malicious actors hack them? Like the SSH issue with adding vulnerability? So probably some observability.

Brian Demers 00:23:37 Yeah, the XC utilities, yes. I think those types of problems are going to be bigger to deal with. So there’s some social engineering aspect to that, but just to increase amount of code that’s coming in from AI. Good or bad means there, the volume of change is happening quicker. So maybe your open- source project or your corporate project is getting, I don’t know, 10 changes a week. Just some random number. 10 changes a week. That’s a very reasonable number where somebody, one person could sit down and review those code changes, whatever, you can scale that by whatever you want, but as soon as you 10X that number, that problem gets much harder. And doing thorough reviews of your code is harder. So we’re going to need better tooling around quality gates. All of these things need to be automated. The human eyeball is probably less a factor here and more a factor of, is this code readable? Not so much, is it correct?

Giovanni Asproni 00:24:30 And now let’s look into maybe how to improve the observability in our tool chain. So the first question I have is obviously related to the business. Yeah. We say we want to have better observability in our tool chain, and somebody from the business side can say, why? This is going to cost me money, what am I going to get out of it?

Brian Demers 00:24:50 Yeah. So my response would be, why do we have observability in our production applications. I mean, the answer that from the business side is probably some sort of risk answer. Something goes down, capacity planning. We do capacity planning, so things don’t go down. It all kind of comes back to risk. Something goes down, how long does it take to bring back up? So my troubleshooting window is smaller. So the same thing happens for our build tools. So if something happens, we want that, the resolution to that issue to be shorter, or again, we want our developer feedback cycles to be shorter so we can be more productive and essentially being more productive, faster time to market, I can increase the number of features that I’m working on or my team is working on. That’s the business reason, I think.

Giovanni Asproni 00:25:36 Have you got any examples from real projects about this as well? Because I gradually just mentioned about customers with the length of the build and this kind of stuff.

Brian Demers 00:25:45 Build time has been our bread and butter, I think. That’s where our build acceleration technologies has really been big for us. But the only reason why they work is because of the observability data. So you want to find a bottleneck in your build. Well you need to monitor have detailed information about your build. But let’s just say we’ve had customers massively overprovision build agents, for example. So I mentioned issues with disc io. But let’s just say you’re paying for a Cloud compute and your builds are only using, I don’t know, 10, 25% of your CPU. So you could probably save a ton of money by using smaller build agents, assuming they’re in the Cloud. If they’re in somebody’s closet somewhere in your basement, then that compute is already paid for. But in the Cloud, you’re paying per minute or whatever.

Brian Demers 00:26:31 So there’s some serious gainsay to happen there. But usually what happens is nobody monitors that. So my build is what it is. I’m going to throw more hardware at it to make it go faster. But there’s a, I don’t know, it’s not Io bound or it’s not a CPU bound, or maybe it is, but it’s limited to one thread. So, I have four CPUs on my system, and I can only really make use of one of them. So I’m just throwing dollars out the window. So we’ve seen a handful of cases like that. And the opposite is true. So we’ve seen cases where customers are overutilized, so their agents are not provisioned big enough. A few years ago when the max switch to Apple’s silicone, so at Gradle in-house, we did some performance benchmarks for ourselves using our own product of course. But we found out that our builds for Gradle build tool and dev velocity were faster on the Apple Silicone max than the Intel max. And it was such a difference where it made sense to buy developers new laptops. Which makes everybody happy. The company’s actually saving money, the developers get a new flashy laptop. So it’s a win-win all around. But again, you can only do these things with data.

Giovanni Asproni 00:27:39 With data. Yeah. In your experience, what are the considerations a team should actually make to the decide between adding more observability to the tool chain or simplifying the system they’ve created? Because I guess the more complex the system, the more observability needs, or the more complicated the tool chain may be, but so have you come across situations where saying, well what have you consider making a simpler system that will give you a better bang for the buck?

Brian Demers 00:28:09 I haven’t, but that’s a great point. I know I talk about this a lot when I talk about choosing build tools. So for folks, not in the Java world, I’m a Maven fan, which is, there’s two big Java build tools, there’s Maven and Gradle. I’m a big fan of the Maven tool. And then of course I work at the other company, which is a very interesting mix. So there’s a lot of discussion around about simplicity and speed and which one’s better, which one makes you more productive, which is great and interesting. But I do think that’s an important thing to consider. I think our build systems are complex because I can write code to do whatever I want. It doesn’t mean I always should do that. So I think simplifying your system is in your best benefit. But again, how do you know your change made an impact? If you don’t have any data to start with and you make some change, you’re like, it’s better now. Well, better than what?

Giovanni Asproni 00:28:58 So you should start with some observability anyway. This is what you’re saying?

Brian Demers 00:29:01 Something. So if you’re going to make a change, even if it’s just doing some basic back of the envelope benchmarks and then making a change and then doing them again, I still think that’s a great win.

Giovanni Asproni 00:29:13 I also was thinking about things, have you seen that now there is many teams are moving back from microservice based architecture to monolith ones or sometimes maybe still service based, but instead of having a gazillion of them having three things, I kind of reducing the number. And that of course reduces the pipelines, the build systems and everything else. So I think also in those terms as well, so there is a reduction of complexity, not necessarily compiled time, depending on how the systems are written by.

Brian Demers 00:29:43 Right. So I know of a couple customers that have more than 10,000 different repositories. So how do you manage that, without any kind of tooling? How do you manage that? Take this simple case of upgrading a component, whatever build system you’re into, you’re all going to have the same problems. You either want to update a dependency, a runtime library, whatever. So if I want to update from some old version of Java to a new version of Java and I have 10,000 repos, I donít know, let’s say 60% of them are Java 40% are node or whatever. So first I need to know which of those repos that affect me, which ones are using the older version, which ones are not using the newer version, which ones are using a version somewhere in the middle. That maybe don’t need attention now, but they will next time I wanted to make this change.

Brian Demers 00:30:26 But you can’t manage that, that amount of code in that many different places without some sort of tooling. Again, maybe in some cases it’s some basic code scanning and some inventory app that you run in-house. And that’s fine. But obviously I think tracking, as you build these things, each project gives you better detailed information. You’ll know how long it takes to, to build where the dependencies come from, which versions of all your tools you’re using, it’ll give you a lot richer information where you can improve the system.

Giovanni Asproni 00:31:00 Yeah. Now let’s say the company, they decide, somebody decides, yeah, we need more observability here. What are the organizational and cultural changes necessary to implement that? I tell you what I’m thinking about. You see when I work with companies, very often I find that when there are complex systems, maybe with many teams, there are situations where nobody looks at the build and the broken tests, something breaks. It’s not my problem, it’s somebody else’s problem. So in those situations, you can have all the observability want, but if nobody looks at the system at the date and the formation, nothing will happen. So in your experience, what kind of changes have to happen in the organization for this implementation of more observability to succeed?

Brian Demers 00:31:43 I think there’s a little bit of education that needs to be involved. Some developers are going to get it, if I’m troubleshooting issue, and the more information I have, I know that should right, in theory, make it easier for me to troubleshoot an issue. We’ve seen cases where the issue, the kind of enforcement of observability comes from your CI team. So there’s a lot of bigger shops that as long as if there’s a CI issue, what happens is the developer’s like, oh, it’s not my problem. I’m going to open a ticket and this central CI build tool team will look at it. And they don’t know what to do unless it’s, some sort of centralized common tooling problem. There’s nothing they can do about it. So what they implemented is we have a product called Build Scans, but it’s essentially a detailed view of all of the information that happened in your build. So what they’ve done is they’ve said you can’t open a ticket with a central team without this observability data. But the fact that you have that observability data means that someone can probably look at it before sending it to the CI team. So that was one way that that other team kind of enforced that the data exists for them essentially. But also because it exists, it makes it easier for the developer to kind of self-select, should I fix this or is this somebody else’s problem?

Giovanni Asproni 00:32:56 Okay. And in terms of data, so we have many tools. They say we had observability. How can we avoid being submerged by deluge of data? So how do we select what to observe?

Brian Demers 00:33:08 Yeah, that’s a great one. Part of me wants to say use, observe everything. And then you can use that data as needed. But there’s obviously cost considerations, timing considerations. So I would say start with the problems that are bothering you the most. So again, we start talking about developer experience. If feedback cycles are the thing you want to improve, then obviously the timing of things is at the forefront. So measure CI times measure local developer time. So builds on local developer machines. You can measure how long it takes for something to fail. So obviously if your build’s an hour and most of your failures happen in the first 30 seconds, that’s really good. That’s probably an ideal case. But that’s kind of where you’d want to be. So measuring the basics first. And then you can do the simple things like how often your developers are running builds?

Brian Demers 00:33:57 So if your build is an hour, you can only do that a couple times a day. So if you can shave, I don’t know, 20 minutes off it, maybe you can do three, four times a day. So that’ll improve productivity, it’ll improve throughput, it’ll improve your CI, your CI Q times should go down because your builds are smaller. There’s a lot of information just in that basic measurement of time that will help improve. And of course if you want to get into kind of the security angle, like really monitoring your dependencies and those things, that’s obviously a bit more involved. But I think starting with some of the basics to just get an understanding of where you’re at is a win, but it’s up to the company. Like find your problem and focus on measuring that. And then what’s going to happen is you’re going to have the aha moment or that wasn’t actually the problem, it’s this. But because you’re starting from somewhere, it’s a lot easier to incrementally improve your position than then go from zero to 60.

Giovanni Asproni 00:34:51 Okay, let’s say now, the company sold, they want to do this, but, and they have a greenfield project to start with, how can they start? So say in a greenfield project, how would you approach this adding observability to the tool chain?

Brian Demers 00:35:05 That’s a good one. I think that might be harder because you don’t have anything to compare it against. But it’s also good because there’s no preconceived notions of the types of data that you need. The problems that you’re currently having. So again, I would start small. So instrument everything. So whatever tooling you’re using that supports any kind of instrumentation or data collection, I would do that. If you’re a shop that uses some sort of logging service, you can even dump your builds into a logging system that’s still better than nothing. That probably gets expensive really quick. But I would say start collecting data. I think you need to also make some sort of decisions on that data. So not just collect it but also maybe some server alerting. So again, I keep beating the build time metric, but it’s an easy one or flaky test.

Brian Demers 00:35:55 Let’s go back to flaky tests. So percentage of flaky tests, if that raises more than say 2%, you need to alert somebody. So there needs to be some actionable event that happens. And if you’re starting with a greenfield application, maybe that’s a really low threshold. And maybe it’s always going to stay that way or there’s always some inherent flakiness with systems now. But starting with a low number, being able to do some alerting, whatever it is create a slack bot, create an email, open an issue automatically, whatever it is, assign the issue to somebody to fix right away. That way the data that you’re collecting is actionable and it’s being used.

Giovanni Asproni 00:36:32 And what about also even if there isn’t a lot of data in addition to what you say, actually an approach could be also a reason to do something from the beginning would be to set up the necessary infrastructure to actually being able to collect the data in the first place. That is what you said, pretty much instrument. Every tool that has some gives you the possibility to add some instrumentation and collecting data job, do that. So I would imagine that even if you don’t use the data at the beginning, just having the infrastructure to collect it and put it somewhere could be useful.

Brian Demers 00:37:05 I agree. The amount of learning that your team will gain. So whether it’s your, the development team working on the greenfield application, some centralized team that’s supporting them with the observability infrastructure, it’s a very low risk way to gain organizational knowledge around that and then start figuring out the questions you want to answer. And then you could scale from that greenfield project to other projects.

Giovanni Asproni 00:37:28 And then the other question is, of course for legacy systems, which from your previous answers, I would imagine that this is the easier part because you have data now or what to look for.

Brian Demers 00:37:39 I think so at least have some questions that you want answered. So usually is this thing slow or, I donít know, that’s a big one. So if you have, again, a million lines of code and is this thing slow? The obvious answer is yes. Why is it slow? Because it’s a million lines of code. But that’s not really like an actionable answer. You can’t like, well I’ll just delete half the code. You can’t do that. But there are things you could do what percentage of the time is running tests? It’s probably a large portion, but maybe linting is actually five minutes of that. So I love linting, I love all these quality static analysis tools, but a lot of them can be hooked up in your IDE. And so the incremental cost of running them is almost nothing.

Brian Demers 00:38:21 A type of line and instantly my IDE gives me a red squiggly line or some warning or whatever. So if that’s the case, if I’m already getting that feedback, then I probably don’t need it when I run local builds. I probably do need it in CI to make sure those things don’t actually happen, and people aren’t ignoring IDE warnings. But that could be five minutes off your build that you cut off now five minutes over an hour, maybe that doesn’t move the needle for you. But that is something. And there’s probably a lot of those types of events that happen that can be improved over time.

Giovanni Asproni 00:38:51 But I guess that on the other hand, in legacy systems, depending on what you’re doing, might be difficult to instrument the code and act on it as well. Because I’m thinking about, for example, security issues. It’s like if you start looking for them, you’ve never done that before, you may end up in a situation where have a pile of them and you cannot really stop building your system until all of them are solved. So you need to have a prioritization of sorts.

Brian Demers 00:39:18 Absolutely. And there’s a handful of tools that will help with that for like security scanning specifically. But I think that’s another one of those types of issues where you incrementally improve. So if you have a thousand security issues today, you want to make sure that number doesn’t increase. And then hopefully you target that number, and you can slowly drop it or quickly drop it would be better. But a lot of build systems are older systems, legacy systems are overly complex. They’ve migrated between versions of tools. They’re in states of migration. I know one system that I worked on a few years ago that was migrated from ANT, so in the Java world, ANT is this pretty legacy build tool. What happens with a lot of these migrations is it was migrated enough to work with a new system and then there’s still a lot of tech debt. Or build rot or tech debt that’s in the system that was just never cleaned up. But it actually affects the performance of the system, how easy it is to add new tools. Again, if you wanted to add a new security scanner, even that becomes harder because there’s always these special little edge cases around the tools. So going back to the simplification of our systems there’s a lot to be said for that, but just knowing where the complexity is, again, is still valuable information.

Giovanni Asproni 00:40:32 And for things that are new to tool chain observability, have you got any suggestions about metrics? First step metrics, we want to have some observability. What we start to look for?

Brian Demers 00:40:46 I think build time is the easy one. I think that’s the one that is the easiest to comprehend for folks. And I know I keep saying it, but it’s like test coverage. So there’s a lot of people who hate the test coverage metric because it can be abused or gained, but I find it still a very valuable metric. Just you take it with a grain of salt. So again, take build time with a grain of salt for example. But it’s an easy one to measure and to improve and to know if there’s a problem. So, there’s a lot of science that that talks about context switching. So how long it takes your brain to context switch. Basically, if your feedback cycle is over some number of minutes, it’s actually, it does you a disservice. So I think the science says it takes 10 minutes to context switch away from something to be productive again, and then 10 minutes to switch back.

Brian Demers 00:41:34 So that’s basically 10 minutes of time. So if my build is say, 20 minutes. That’s probably the worst-case scenario. So I start to build and what’s going to happen is I know it’s going to take 20 minutes, so I’m going to go check my email. And then it’s going to take me 10 minutes to kind of focus on my email. So that’s an extra 10 minutes I wasted. And then eventually I’m going to come back to my build and it failed. It failed whatever, six minutes into the build and I’ve been sitting on my hands scrolling through my email for the last half hour. So I go back, I was like, oh it’s something silly like I missed a semicolon or some silly change. I fix it and then I start the whole process over again.

Brian Demers 00:42:11 So again, it took me a while to refocus my efforts. Anyway, those are all really bad behaviors, but needing to kind of think about that a little bit as we’re thinking about improving feedback cycles, I do think is important. So I think without any kind of measurements, you don’t even know if you’re at risk for that. So if all your feedback cycles are 30 seconds. Yeah, you can make it better, but that’s really good. If your feedback times are an hour, that’s a huge problem. So I keep saying the same metric, but I think that’s the big one.

Giovanni Asproni 00:42:43 In terms of implementing observability in the tool chain from an organizational perspective, what are the most common issues organizations encounter when trying to do that?

Brian Demers 00:42:53 I think it’s somebody owning the tools that are around it. So if you have, if you’re more than one team or more than one product organization, so if you’re a single product company, it’s pretty easy to get things moving because everything revolves around one code base or one product. So all of the teams are very, very connected. But if you have kind of a centralized team that manages, I donít know, 10, a hundred, a thousand different projects, those tools are usually owned by somebody else and whatever tool it is, our tool, somebody else’s tool, logging collection, whatever it is. So just the amount of effort it takes is challenging for somebody. So a developer comes in and say, hey, this build is slow we should fix it. And everyone’s like, yes, we should. But they’re like, I don’t really want to request the new infrastructure or deal with the other team.

Brian Demers 00:43:38 They don’t understand my problem. And so I feel like there’s probably some social issues here. And this isn’t just for observability. This is kind of any tool that will improve your life. I know people who have a hard time getting an IDE license for the same reason, which is mind-boggling. You want to talk about a productivity tool, your IDE is kind of where it starts. So I feel like there’s some of that that I feel like we need to get over as an industry. But also the knowledge I don’t think is there. So like I said, I mentioned before, so you wouldn’t put an application in production without monitoring. Everybody just kind of assumes that. That’s table stakes. I don’t know, it’s not shocking to anybody. If I came to you tomorrow and said, hey, I’ve got this tool or this application we’re going to deploy. My organization probably has documents laid out of how it’s going to be monitored. We have playbooks, whatever you want to call it, that’s already all resolved. But we don’t have any of that for build systems. So there’s a little bit of an organizational learning curve to get there.

Giovanni Asproni 00:44:38 So how do you see the future of observability in the tool chain evolving?

Brian Demers 00:44:43 That’s a great one. I think the importance of it is going to increase. I think from the angle of security perspective, I think it’s going to have a huge impact because the rise of six door Salsa and Toto, all these other things are coming from different use cases of getting data out of the system. So basically for folks not anywhere, you essentially have inputs and outputs into a system. And you want to know what those inputs are, and you want to know what the outputs are, and you want to make sure there’s security in place. Like did this Git commit actually come from, I donít know, GitHub? Was this thing that was built, was my binary built on a machine that was approved and met some sort of criteria. Great. We have a structure of defining that information, signing it, making it secure and available.

Brian Demers 00:45:35 But that’s very similar data. If you think about it, your build tools want to know the same types of things. What went into the system, what comes out, those are the things you want to track. You could essentially define policies around all of that. So if you want to define a policy for your deployments, so your production applications need to say, we’re only going to deploy things that were built on these approved systems. Well then you need to have infrastructure in place and policies and tooling to know that all of that stuff actually happened. And you need to have some alerting that when those policies are not met, then your CI systems fail or whatever it is to give feedback faster to developers. So I feel like there’s going to be pressure from that angle too, that are coming into the system as well as obviously productivity is a big one.

Giovanni Asproni 00:46:22 Is there anything that we missed that you would like to add?

Brian Demers 00:46:25 I don’t think so. I think the big thing for me again is we really need to start thinking about our tool chains as production systems. They’re the production infrastructure for getting things to production. So when we start thinking about things that way, I feel like that removes a lot of the friction that we have or how, or at least frames how we think about the problem to begin with.

Giovanni Asproni 00:46:46 That’s a very good point. I agree with that. Okay, that was the last question. Thank you, Brian, for coming to the show. It’s been a real pleasure. This is Giovanni Asproni for Software Engineering Radio. Thank you for listening.

Brian Demers 00:46:57 Thanks for having me.

Giovanni Asproni 00:47:06 This is Giovanni Asproni for Software Engineering Radio. Thank you for listening. [End of Audio]

SE Radio 675: Brian Demers on Observability into the Toolchain

Show Notes

Related Episodes

Other Resources

Transcript

Join the discussion

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts

Search

Search

SE Radio 675: Brian Demers on Observability into the Toolchain

Show Notes

Related Episodes

Other Resources

Transcript

Join the discussion

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts