SE Radio 723: Dave Airlie on Linux Kernel Maintenance

Dave Airlie, a Distinguished Engineer at Red Hat, speaks with host Gregory M. Kapfhammer about Linux kernel maintenance. After over-viewing the scale and structure of the Linux kernel, they dive deep into the review and validation of kernel patches, drawing on examples from the GPU subsystem. After discussing the features and benefits of the Linux kernel’s maintenance model, they also explore kernel maintenance best practices and the supporting tools for these practices. Dave and Gregory also discuss topics such as the integration of Rust code in the Linux kernel and the ways in which AI-driven code review are influencing kernel maintenance.

Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Related Episodes

Other References

Dave Airlie Linux Graphics blog
Dave Airlie’s Profile at NVIDIA Developer Blog: Author: David Airlie | NVIDIA Technical Blog
[Q3] Dave Airlie’s blog on vendor code sharing: Linux graphics, why sharing code with Windows isn’t always a win.
Why Github can’t host the Linux Kernel Community
Rust moves from experiment to a core Linux kernel language – Spiceworks
Linux Kernel Maintainer Handbooks: Subsystem and maintainer tree specific development process notes — The Linux Kernel documentation
Linux Kernel Development Process: How the development process works — The Linux Kernel documentation
Linux Stable Kernel Rules: Everything you ever wanted to know about Linux -stable releases — The Linux Kernel documentation
Kernel articles: Kernel coverage at LWN.net [LWN.net]
Linux Kernel Mailing List (LKML): LKML.ORG – the Linux Kernel Mailing List Archive
Linux_Kernel_Newbies – Linux Kernel Newbies

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Gregory Kapfhammer 00:00:18 Welcome to Software Engineering Radio. I’m your host, Gregory Kapfhammer. Today’s guest is Dave Airlie. Dave is a longtime Linux kernel maintainer and a distinguished engineer at Red Hat. Dave, welcome to Software Engineering Radio.

Dave Airlie 00:00:35 Hi Greg, thanks for having me on.

Gregory Kapfhammer 00:00:37 I’m delighted to talk to you today about Linux kernel maintenance. Dave, you’re the direct rendering manager, subsystem Maintainer in the Linux kernel, and you have nearly two decades of kernel maintenance experience and I’m super happy to learn from you throughout the episode. Are you ready to dive in?

Dave Airlie 00:00:55 Yes, let’s go.

Gregory Kapfhammer 00:00:56 Okay. So, at the very start, I mentioned a moment ago that you’re the maintainer of something that’s called DRM. Can you just tell us quickly what is DRM so we know the kind of work you do on the Linux K\kernel?

Dave Airlie 00:01:09 Yeah, well since DRM is a bit of an overloaded term, it’s not the bad one, I suppose is the best way to describe it. It’s I think called a Direct Rendering Manager, which is a legacy name for what was just the GPU or graphics support subsystem into the kernel. The name was given to a small component maybe 20 years ago and we have expanded on it, but we just kept the name. So, I think the name is DRM. I don’t think Direct Rendering Manager even really makes sense for what it does anymore, so stick with the acronym, but it pretty much when you hear DRM, just think GPU graphics accelerators.

Gregory Kapfhammer 00:01:40 Okay. So, we’re going to use DRM and your experience with maintaining it throughout the episode. But to get us started, we want to talk a little bit about some of the scale and structure issues associated with the Linux kernel and we’re going to talk about subsystem workflows and how you do release engineering, and we’ll use DRM as concrete examples throughout the episode. So, with that in mind, can you tell us a little bit about the scope and the architecture of the Linux kernel? How many maintainers are there? How many subsystems are there? Please give us a few initial insights.

Dave Airlie 00:02:15 Yeah, Linux kernel is, well, we consider it’s probably the biggest software engineering project in the world, at least that’s what we like to say. In terms of maintainers, the number is varied. Changeable, I would say on average is between 80, a 100, 150. The varying levels of what a maintainer is and where they sit in the hierarchy moves a lot. We have every year maintainers summit where Linux invites like the top 30 maintainers to that. So that’s a number of like the top-level people that Linux needs to talk to. But that scales out then into a hierarchy that can expand to up to a hundred, I would say in the kernel at the moment. In terms of subsystems, again, there’s no strict definition of what exactly a subsystem. The subsystem is something that a maintainer maintains usually. So, in the same sense of there is the number of subsystems that there are, the number of maintainers and some maybe not one is to one, but it’s pretty close. But there are a number of major subsystems would probably be like the graphic subsystem, the networking subsystem, the CPU, like X86 support, ARM 64 support, and those are the major areas, storage file systems, memory management. So, they’re probably the biggest groupings that we have.

Gregory Kapfhammer 00:03:20 Okay, so we’ve got DRM as one subsystem and CPU scheduling or memory management or file systems, those are all examples of subsystems. So, if we take DRM as an example, how many people are working with you as the maintainer of the DRM subsystem?

Dave Airlie 00:03:37 If you’d asked me this yesterday, I would’ve said a hundred, but I ran the numbers last night and apparently, we have over 300 people contributing every kernel release to the DRM subsystem, which actually surprised me just as much. So, they don’t all work directly with me. Again, we have our hierarchical system. I probably work on a weekly to fortnightly basis with maybe 20 people in that range, 15 to 20, and then they work in the hierarchy with the other grouping. But yes, I was surprised that we have up to 300 contributors per kernel for the DRM sub system.

Gregory Kapfhammer 00:04:08 Okay. So, there’s 300 people and they’re somehow reporting to you and you’re managing their work and you’re reviewing their pull requests. Can you give us at a high level a rough idea, what are those 300 people doing? And then what are you doing when it comes to managing the infrastructure for DRM in the Linux kernel?

Dave Airlie 00:04:27 So this is probably an area where the DRM subsystem is a little bit different to a lot of the other subsystems in that I actually have quite a large hierarchy. I would think more like a corporate style of, there is two co- maintainers, myself and Simone Veter, and then below us we have a range of maybe six or seven more feeding into us and then they have a range of people feeding into them and then there’s all of the frontline developers probably feeding into that. So, the hierarchy is depending where you sit on it, just like in a corporate structure is very different for what your job is. So, my role, a lot of it is I suppose I would call facilitating and creating an environment for people to be able to work in and be happy to work in. My main weekly task is I deal with Linux; I submit the pull requests to Linux on a weekly basis.

Dave Airlie 00:05:14 I gather all the pull requests from the people below me and I send them up to Linux. The amount of review I personally do on those pull requests is very limited at this point because they are so large already that I have to trust that the maintainers below me have done appropriate review and they have to trust that the maintainers below them have done appropriate review. The review all happens in public on the mailing list, so we can always go back and reference it. But for me personally, on a daily weekly basis, I don’t often review the pull request at that much. I do a set of standard bills on them; I make sure there’s no obvious they don’t build. That’s a big problem. But mostly it’s a maintenance role for me is very different than a standard maintainer. A lot of the standard maintainers will be applying the patches to our trees, checking the quality of the work and making sure that it’s suitable for going upstream to me.

Gregory Kapfhammer 00:06:02 Thanks, that’s really helpful. So, I have the understanding that people are in a way working for or reporting to you. And you mentioned a moment ago that you’re reporting to Linux tour vaults as well. Can you tell us a little bit more about how you report to Linux and what type of work you do on behalf of this core leader for the Linux kernel?

Dave Airlie 00:06:22 Yeah, so one thing with the kernel is Linux is very delegating. He delegates a lot of trust to the top-level maintainers, and it is kind of like, I hate to use the word fiefdom, but you own your garden like you are in charge of maintaining that section of the kernel and he trusts you until you give him a reason not to, I suppose. I’ve worked with him now for 20 years. I’ve had a few public falling outs with him. They’re easy to find on the internet, but overall, we don’t have a huge amount of interaction. Most of my interactions with Linux are at the maintainer summit once a year. We’ll meet up and we’ll have it, we’ll just talk about life, but mostly it’s about normal things. People like scuba diving, so there are things like that. But in terms of the kernel, every week I send him an official pull request with the list of all of the fixes for that week from all of the people in my group and he then will process that if he processes that, the system will send an automatic reply telling me that Linus has taken the request.

Dave Airlie 00:07:16 If you get a reply from Linus himself, that’s usually only if you’ve done something wrong. He is not really good at giving positive feedback. He is much better at giving something wrong feedback. He’s known for not giving the positive side. So, you generally look for the Linus on the reply and hope you don’t see it and then things go along swimmingly. Once every, like we’ll go into the release cycle later, but once every nine or 10 weeks you have to send the big pull request, which is all of the work that is due for the next kernel. That’s the one that often upsets him because that’s the one where you will have a lot of change and there’s always regression with change and you just have to hope that he doesn’t catch a really bad regression, or you haven’t messed up some other part of the kernel.

Dave Airlie 00:07:56 But in general, my attractions Linus are purely on me. I send him a weekly email, I hope he doesn’t have a problem with it, if he has a problem with it, I delegate that down to the people who cause the problem and I generally accept the, I’m an umbrella I suppose as well. I need to protect the people under me from his rot and he’s calmed down over the years, but I’m used to it at this stage, so it’s, I sometimes have to take an hour or two to breath and then I’m good.

Gregory Kapfhammer 00:08:19 Okay. So that’s really giving us a nice sense in the day of the life of a Linux kernel maintainer and we’re going to do a deep dive into all those topics in a moment. But before we do that, can you tell us a little bit about some of the specific challenges that you face when it comes to maintaining DRM in the Linux kernel?

Dave Airlie 00:08:37 I think the biggest challenge I’ve faced over the years and Simone and my co-maintainers has been great at helping with this is that the scale of things when you get to being this big is quite hard to both communicate to other maintainers and to have interesting topics with other co-maintainers who are dealing at the 10 people level or the 40 people level, they don’t quite understand why the scale changes. When you get to our level, like even when I was thinking we have a hundred people going through it, the scale problem is different, but now that I’m thinking 300, I’m like, okay, no, our scale problems are definitely valid and different than theirs. So, for example, we have to have a hierarchy. We have a common tree where everyone has commit access for a bunch of things and we try to have a group maintainer ship model, which isn’t something as standard in the kernel X86 subsystem kind of did it with three people.

Dave Airlie 00:09′:26 We’ve done it now with maybe 30 to 40 people. I think in our group maintainer there’s probably more, there’s three maintainers, but there’s a group of committers. So that system is different than everyone else’s. So that’s probably one of the biggest challenges. I think scaling the standard kernel development processes up to our size of the subsystem has been the biggest challenge we’ve faced over the last years and in my world, a lot of my job has been to just sort of step away and not let my, I suppose ego or my own niched workflows drive everything else’s and I need to accept that okay, this is something that makes the community better and maybe it makes my life a bit harder, but I need to use these scripts or just develop methodology because it makes it easier for everyone else. And often we have a problem where over the years we get our little niche of how we develop things, and we stay in it. And being able to not do that has been very good for me. And in the helping the subsystem to grow, I think because scale is hard.

Gregory Kapfhammer 00:10:20 So you’ve talked about scale from the perspective of the number of humans and you mentioned there are like 300 people who are developing various types of patches or change sets. Can you tell our listeners a little bit about the size and the scope and the scale of your part of the Linux kernel when it comes to things like lines of code or size of a change set, other details that you can share there?

Dave Airlie 00:10:43 Yeah, the lines of code’s a bit of an outlier for us because we have a huge amount of auto-generated headers which both piss people off and are very valuable for us for a few drivers. But I think we’re talking in at least a million lines of code under my purview in that subsystem. The headers are probably add another million or half a million onto that. But so yeah, you’re probably talking at least in the 1 million to 2 million lines of code. I don’t have accurate numbers on the number of changes sets I process, but I would say it’s in the 2000 to 3000 every kernel cycle. So, every three months I’d say we put in about two to 3000 of that. I would say I probably process maybe 50 to 60 pull requests from sub containers in that cycle as well. On average on a week, I’d say I process maybe five to six pull requests from sub maintainers and they normally at this day of the fix’s stages of the kernel probably only have 10 to 15 patches in them. But everything in the big part of the merge request, that’s where the majority of things come through.

Gregory Kapfhammer 00:11:39 That makes a lot of sense. Now I know many people who use Linux, they have heard of this idea of an RC build, or a Release Candidate build. And so, in this next phase of our show, I’m hoping that we can give the listeners a sense of your workflow or the cadence or the processes that you follow in Linux kernel maintenance. So, can you tell us a little bit about how many weeks you work for until you get to a release candidate? Can you give us a few details about that process as we dive in?

Dave Airlie 00:12:09 Yeah. I’ll probably give the general overview of how the kernel works and probably a bit more into what we do. So overall on the kernel has what we call a nine-to-10-week release cycle. Depending on how happy Linux is, it can go nine weeks, it can go 10 weeks if he feels there’s been a reason. Mostly we’ve been doing nine weeks daily. We had one 10 week last time. The way that works is the first two weeks of every new kernel development cycle are called the merge window. That’s where Linus takes all of the pull requests from all of the others, all of the maintainers under him and he puts them all together and stabilizes that for two weeks. And we hope in those two weeks that that’s a runnable kernel. At the end of that two weeks, he’ll release what’s called RC1 and then on a weekly basis, usually on a Sunday afternoon Pacific time, he will release RC2, RC3 up until RC7 usually, and then from RC7 he will release the final one the following week.

Dave Airlie 00:12:58 Occasionally it’ll be an RC8, maybe just be an RC9 once, right? It goes through that co. That cycle every week after the merge window is just fixes and everything should be stable. There should be no new regressions everything. There should be no new code line comes down very heavily if you try to sneak things in during those weeks, it’s very specific that the merge window is where you drop new stuff and all after that should just be fixes for it technically should just be fixes for regressions, but we often add like just fixes for general things that we need to back port or we need to stabilize. But in saying that, a lot of people have a misconception about the merge window. The merge window is for Linux to merge the trees that other people have prepared prior to the merge window. The merge window is not when you should be preparing your tree for Linux.

Dave Airlie 00:13:42 The merge window is not when you should be doing development that needs to go to Linux. All that stuff should generally have been ready for Linux prior to the merge window, and it will have been in a tree we call Linux next where somebody will gather all those disparate trees across the internet and just those merges every day. That’s their job just to sit there merging every day and find all the conflicting problems and make sure we understand that building the tree is hard and that merging all these conflicts will happen and that we know about them in advance. So, when Linux gets to do all those final merges into history, he has a history of all of the conflicts, and he has known what’s going to cause problems. He will accept things that haven’t been in Linux next, but he will be often very unhappy about that or will sometimes just say no.

Dave Airlie 00:14:24 It’s generally preferred that we have stuff ready in advance. Again, that’s very subsystem dependent on how that happens. I’ll go into we do it in a graphic subsystem. We have a hierarchy of trees that come from vendors. So, we have an EMD GPU tree, a tree from Intel. We have three for Qualcomm and then we also have a miscut tree for all the smaller drivers will cohabitate and they will all push their changes through that. We have a hierarchy called a next tree that we have open all the time and when that at, when Linux releases RC6, we will generally shut our next tree down. If your stuff isn’t in our next tree by RC6, which gives us usually a two-week window for us to stabilize, we will leave it for the next kernel cycle. You can push it into the next tree, but it won’t get merged till the next external cycle.

Dave Airlie 00:15:10 And that’s how we’ve done that over the years. We initially were a lot worse at that, but we’ve stabilized in the last four or five years onto that system and it seems to be working quite well for us. Pushing in stuff after RC6 is at my discretion usually. And occasionally I will do it if it’s something either personally interested or sort of piece of hard work that we need to get working, but we try to maintain that because we’ve found for stability that allowing a free for all after RC6, it hasn’t led to quality results, put it that way. So yeah, so generally that’s how we were is that we set everything up in advance by RC6. Linux opens are three, two weeks later I will send the big huge pull request of everything that was there at the previous RC6 to Linus. He will accept that he’ll pull everyone else’s in when he hits RC1. We will pull that into artery and then start generating the fixes trees based on that and then start the next, bring all the work that happened between RC6 and that point into a next tree and go again for the next window.

Gregory Kapfhammer 00:16:02 Okay. So that’s really helpful and you’re giving us some insights into Release Candidates and how they’re actually built and the cycle that you go through. Just to make sure all of our listeners are on the same page; you mentioned the words regression and then also the word tree. Can you concretely define a regression and the tree and give us some other insights into how those work in kernel maintenance?

Dave Airlie 00:16:25 I’ll start with tree because it’s probably easier now it’s a Git tree is what we call the kernel tree is just the kernel checkout, but now with Git it’s the Git tree, the Git checkout, the Git repos. So, we have like a number of Git repositories that we use that are, we then have some for DRM, we’ve got a multiple of them for DRM and then there’s what we build a hierarchy from. So, when I say tree, it’s just those Git flows of tops into Linux to me. And for regression, well I’d love to define it properly, but yeah, regression is made to be something that broke in RC1, if something is merged during RC1 that causes anything to go be worse than it was before RC1, then it’s considered a regression. The standard method of dealing with regressions is to revert first ask questions later.

Dave Airlie 00:17:10 So we should remove the patch that caused the problem and then work out what the problem was and fix it in the next cycle. That doesn’t always happen. Engineers are not brilliant at dealing with regressions in that way. Often their instinct is to try and fix it first and that often causes a lot of discussion on what is a regression? Is this considered a regression? If it’s a regression two kernels ago, is it something that should be fixed urgently now? And sometimes that should be more urgent. So, there’s a lot of scope for what actually regression is, but Linux’s technical definition has always been something that broke in RC1 and made life worse for a user.

Gregory Kapfhammer 00:17:41 Okay, that’s really helpful. So, is a regression something that’s connected to performance or correctness? I’m guessing that it could be both of those is that right?

Dave Airlie 00:17:50 Yes, indeed. Anything to be honest, his definition is if it made something worse for a user, whether that user is causing a performance difference or that user is seeing a functional break in their hardware. But his thing is if nobody notices, if a tree falls in the wood. If a regression happens and nobody notices, it’s not considered a regression till someone notices. We don’t urgently find them, but yet we generally have to be a user driving the call.

Gregory Kapfhammer 00:18:15 So in the context of the graphic subsystem and accelerators, can you give us a few concrete examples of what a regression would look like?

Dave Airlie 00:18:23 Quite often a regression actually looks like your laptop screen not turning on for us. That’s the easiest one to give is like my laptop screen is now flickering or not doing what it used to do when I had the last kernel, and I got the new kernel. Often, it’s some of the rendering on the screen might have an issue. There may be my hot plugs stopped working by suspend and resume stopped working. Like in the old days, the more common ones were like, yes, suspend resume stops working or your laptop just doesn’t light up anymore, which are obvious regressions. In other cases, more like on in the data center or an accelerator phase, it’s like yeah, a workload has gotten slower. Like I have this standard workload I test for, I’ve upgraded to a new kernel and suddenly it’s not working as well. Or say if you have some device like a steam deck and you’re trying to move from an older kernel to a newer kernel and suddenly, the newer kernel device isn’t cooling as well or it’s not scheduling as well, you’re not seeing the same frame numbers you were.

Dave Airlie 00:19:14 Those are all considered regressions. And if when the report that they generally will be tried to be hunted down and fixed.

Gregory Kapfhammer 00:19:20 And normally those regressions are reported to you and your team by users, that is people who are actually using the Linux kernel on your devices. Is that the right way to think about it?

Dave Airlie 00:19:30 Yes. Often there will be engineers who are like embedded in those situations. So, like for example, if the steam deck has a regression, it will come from a valve engineer through a MD and it will be reported by that method. It won’t necessarily come to us, but often you will just get an email on the list from someone randomly testing an RC on their laptop and saying, hey, this RC doesn’t work. Another layer is distributions will often take the RC kernels and package them and people will be running the bleeding edge distribution and then they will see it and the distribution will say, oh, this kernel seems to have broken a bunch of workloads or a bunch of laptops. And sometimes it’ll just be me sitting here booting it on my laptop going, hey, my laptop stopped working. Those are rare and Linux is often are Linux will have a specific graphic card from 10 years ago and that graphics card broke and Linus is the first person to find it because he just happens to be the one person that’s in that line that has that card. So, it’s, yeah, they come from a lot of places.

Gregory Kapfhammer 00:20:43 So in a minute we’re going to talk about many of the tools you use and a few more of the development and maintenance processes that you follow. But before we go into that phase of our conversation, I wanted to read a quote from your blog that I thought was really thought provoking. So, you wrote a warning then to anyone wishing for more vendor code sharing between operating systems. It generally doesn’t end with Linux being better off. That’s really thought provoking. Can you share a little bit more about what you meant by that Dave?

Dave Airlie 00:21:13 The thing about what people often don’t understand is the incentives for companies to do things cause results that are probably not what you expect it. So, you are pushing a company to say we want Linux support for this device in your mind produces a Linux driver for this device that is well written, upstream, maintained piece of code that merges well with the kernel. But when you pass that into a company that is writing Windows drivers, their first instinct is how can we share the code between our Windows driver and our Linux driver to cut the cost of doing this in their mind that’s cutting the cost in the real world it probably doesn’t, but companies work like companies do. So often what will happen is they will try to put a hardware abstraction layer in, they will try and port their Windows driver to Linux. They won’t start a rewrite. So, the results you will get will either be a very second-class driver or a driver that can’t go upstream or a driver that needs another impedance layer between the actual vendor and the upstream. So, you don’t always get the results you envision when you say, I want more Linux code, or I want to use more Windows alignment. So, it’s often good to make sure you don’t just let the company run free with that idea because the results you’ll get six months or a year later won’t be what you wanted.

Gregory Kapfhammer 00:22:24 So it sounds like what you’re saying is that in some way the Linux kernel itself might become more fragmented or even harder to support when you start bringing in this vendored code. Is that the right way to think about it?

Dave Airlie 00:22:36 Yeah, and it just becomes harder for us to extract commonality from the code. In the kernel, one of the big advantages of the Linux kernel I suppose is that because everything is in the same tree, we do extract commonality, quite often. The commonality we extract is from our group of drivers. But when you are working in a world where you’ve got your commonalities between your Windows and your Linux drivers, you have a different vector for extracting that commonality. You want to extract the Linux stuff away and keep the common stuff between your Linux and Windows code. But when the kernel, we want to extract the common stuff out of all the drivers to be in Linux and not in the drivers and there’s a mismatch between your Linux and Windows that often gets very difficult for companies to resolve. So yeah, we tend to try to say we want a Linux driver because we want to be able to have it optimal in Linux and if we see commonalities we can extract them. I think the best example I could give you was wireless. Years ago there was a big thing with 802.11 layer and every driver was bringing their own 802.11 layer to the party and someone said, well why is the operating system providing an 802.11 layer? And they did, but still for years we got wireless drivers which were written with the old 802.11 in mind. So those things we see different vectors for commonality than Windows Linux vendors do.

Gregory Kapfhammer 00:23:47 Okay, thanks. That was a good example. Now again, in a moment we’re going to talk about patching and backporting and things of that nature, but you also have just hinted at some of the counterintuitive aspects of maintaining the Linux kernel. And I know that many of our listeners may know mostly about open-source software through GitHub or GitLab or other systems of that nature. Before we go into the next phase of our show, can you say anything about how Linux kernel maintenance might not work well with the existing model that you would have on GitHub?I

Dave Airlie 00:24:19 I think it just comes down to what I said earlier scale. I think the kernel process is just so vast and has so many people that scaling it up to a single tree would be close to impossible. So, at the Git Forge model of like there’s a single central tree that you send pull requests to just doesn’t scale well for us. And that’s I think what comes down to it. There is also another side to it. We have a strong resistance against using proprietary tooling at all in the Linux development process that isn’t a top-down Linux driven thing line. This is very flexible on this, but a lot of the maintainers have personal strong feelings on these things. So, we tend to shy away from using anything proprietary in the development process. So, GitHub for example, although it uses Git who is still quite a proprietary system in the DRM subsystem, we started to use GitLab more extensively.

Dave Airlie 00:25:04 We use GitLab for hosting a lot of our trees and stuff like that. So yeah, I think the scale is generally the biggest problem and the tooling like what we’re seeing problems with our tooling as well. Email and stuff don’t work like it did 20 years ago. Gmail holds all and if Gmail doesn’t want to talk to you, your email servers no use. So, there’s a lot of problems we have to solve ourselves that just because the Git Forges don’t solve them doesn’t mean we don’t have to solve them and we’re working on that with a bunch of tooling at the moment. So, it’s a different aspect to it.

Gregory Kapfhammer 00:25:30 Okay, that makes sense. Now before you mentioned the idea of a release candidate and we talked about RC1 through RC6. I have the understanding that there’s also something called a stable kernel. What’s a stable kernel and how does that connect to an RC?

Dave Airlie 00:25:45 So the stable kernel was something Greg Crow Hartman decided to work on, I can’t remember, maybe 10 years ago. He’ll correct me if I’m wrong. But the idea was that Linux’s releases are all very well and good, but people want to stay on an older release for longer and still get security fixes and regression fixes and possibly hardware fixes and improvements. Maybe not major functionality improvements, but quite often they want like a good baseline that they could build things on all this comes from Android world and devices where you don’t want to be upgrading the kernel for fear of being sideswiped by regression in some other areas. So, you just want to keep going on the same kernel but build out on it. So, the idea of the stable kernel was shipped, no one said it would be a great idea, but Greg pushed through and shipped it.

Dave Airlie 00:26:27 It has, it comes out, I don’t show what Greg’s release schedules are. I think he does one every week or two as well. And there’s a maintenance of about maybe four or five different stable kernels over the range of the last maybe 10 or 20 every so often Greg picks a kernel as the LTS which is the long-term stable release, and that kernel will get maintained by Greg or someone else. Often some someone else gets handed that job. One of the initial things of the stable kernel that was kind of announced was that maintainers should not be required to add to their process to enable the stable kernel. That the stable kernel should try to not force maintainers to support it so that we have an option. I can’t say that was good or bad in the graphic subsystem. We are both good and bad at helping the stable kernel.

Dave Airlie 00:27:09 Some parts of us are really good and some parts are not. We don’t have a great system. We don’t do bespoke work for stable in our group. Generally, we let the stable maintainers take care of it as much as possible and occasionally if we see a patch that we know should go into stable, we will tag it for that. And people in the hierarchy know to tag certain things for stable if it’s a regression from two kernels ago. And often we will automatically add fixes lines and tag stable on things we think and then let them decide whether there’s something that should go into it or not. But the rules around stable patches are pretty strict but they’re not that flexible. They usually should be less than a hundred lines, and it should be quite self-contained. So, we don’t like to get series into the stable kernel.

Gregory Kapfhammer 00:27:47 So you just mentioned the next word that I wanted to talk a little bit more. You talked about the word patches. Can you say precisely what is a patch and how do you manage a patch when you’re doing Linux kernel maintenance?

Dave Airlie 00:27:58 Yeah, that’s a good question. Years ago, I don’t think I told what I had to answer this, but in the merge request world it’s a different way of working as in the kernel we treat a patch as the smallest unit of work to make a change but be very self-contained. We put a lot of effort into our patch summary like our text in the patch because the commit message is actually very important from the kernel. A lot of other merger flow people put a lot of work into their merge requests. We put that work as much as possible into the commit messages for each patch. Each patch should be very self-contained. It should do a single thing we there should not be if your commit message starts having and but then you’re like okay, maybe there should be two patches. So, you get the instincts for what a patch should be over years of writing patches.

Dave Airlie 00:28:42 You should not drop a whole new driver in one patch. You should try and drop infrastructure changes needed for your driver. Then the core of the driver then features for the driver, and they should all be well explained. I suppose the other way to think of a patch is it’s the unit for review. This is something that should be digestible by a reviewer and they should be able to decide on a patch-by-patch basis whether there’s something that we should accept into the kernel or something we shouldn’t accept in the kernel and or there’s some problems when we need to iterate on again. And in a patch series is a gathering of those patches with a cover letter that describes the overall idea of where you’re trying to go with the series. And then each patch then will take you on the journey to the result that you want.

Gregory Kapfhammer 00:29:20 Okay, that makes sense. Can you tell us, are there some tools that you specifically use when it comes to managing a patch or a series of patches and then how do you use those tools?

Dave Airlie 00:29:29 Like everything? There’s a lot of tools and they’re all different ones and every kernel maintainer probably has their own set. I think the current core is Git. Everyone nearly has agreed, like years ago we had other systems with Git, the current core, you generally will develop all of your changes in a Git tree on your local machine. You will, when you’re finished started with the development, you’ll have to go back and probably make a patch series out of that development that is linear insane because often development is not linear insane. So, you have to go back and re-distill your patch series into something linear insane, then go through the process of making sure it’s submittable, making sure you’re following the rules that’s and actually as patch series you want to send out. And then you go to the process of sending it to the mailing list.

Dave Airlie 00:30:07 That’s why you use that through email. We have a new tool called B Four. We have a set tool for sending patch series out. There are also other patch management systems. So, Git isn’t always the best thing to manage patches with. So, there’s a number of things, there’s a series of tools called Quilt, there’s a thing called StGit. So, depending on how you want to manage your patches, there are other alternate tools. I personally, when I’m writing patch series, and I don’t write a lot of patch series for the kernel these days will use just use Git trees, and I will keep the cover letter just to aside but that’s only because I’m probably only ever working on one major part series at a time anymore. I’m not doing a lot of streams of work at that level.

Gregory Kapfhammer 00:30:46 Okay. So, you have patches, and you have patch series and I’m glad you defined each of those terms. I also quickly wanted to talk about the concept of a backport. What is a backport and how does it connect to the topics we’ve discussed so far?

Dave Airlie 00:30:59 So a backport is probably many aspects of what constitutes a backport. So, you can consider a patch going into stable being a backport as in at the simplest term a backport is something that you have to maybe adapt to an older kernel as opposed to just apply to an older kernel. So, if you have a patch that goes into like the latest kernel and you want to apply it to a kernel that’s five kernels old, sometimes that patch will just apply and it’s fine. And that’s a technically a backport but it’s not really, you haven’t done anything. But if you actually have to modify that patch to make it adapted to the older kernel API or just the owner state of the driver back then that’s more getting to what an actual backport is because say you have a security fix for something in the new kernel and it’s the same security problem exists in the older kernel but it’s in a very different state.

Dave Airlie 00:31:43 Like the code that causes has been changed substantially but there’s still underlying security problems there. Well then you need to adapt that fix for that kernel and sometimes it’s a full rewrite, sometimes it’s just a concept you’re applying back but you still need to do it. You then have to scale that up again. If you look into, I’ll talk about Red Hat works. So Red Hat has a REL kernel, we backport subsystems. So, like we will take the full DRM subsystem, which is like thousands of lines of code and patches and backport that hole on masse to a kernel that was maybe eight kernels old. That’s like that’s extreme backporting. And you see Ubuntu do the same. You’ll see a lot of stable kernels for devices. Android do a bunch of it. So backpor is kind of like, I won’t say the dirty secret, but if you’re actually releasing a product of something using Linux kernel, you would generally be in the market for having to backport things to that kernel and need your engineers to do that.

Gregory Kapfhammer 00:32:36 How does backporting connect to the idea of doing a revert? What’s a revert then?

Dave Airlie 00:32:41 So, a revert is simply just, I found a regression we need to get rid of that patch out of the tree, but Git doesn’t allow you to change history. So, you just revert the patch. You take the patch that’s in the tree and you apply the reverse patch and that’s it gone from that tree. Then you have to make sure that the revert is also treated just like any other patch, it goes into the stable cycle that makes sure it gets removed from the older stable trees. Backboards should pick up the revert. So generally, reverts have to have that will have a tagline for what they’re reverting so exactly this patch reverses this patch.

Gregory Kapfhammer 00:33:11 Okay, that was awesome. So now we’ve covered a lot of the key details and terms and if listeners want to know more, they’re welcome to check in the show notes and we’ll provide links to kernel.org articles and handbooks and things of that nature. What I wanted to pick up on next was something that you mentioned just a moment ago, which were issues about security concerns. Can you tell us a little bit about how you do security reviews or maybe talk a little bit about how you do performance or stability reviews in the Linux kernel maintenance process?

Dave Airlie 00:33:40 I’ll deal with security first because it’s kind of a special case in some ways. So, we have a dedicated kernel security team. I’m not on that team; it’s Greg is probably one of the leaders on that. I know Linux is involved; there’s an embargo process but it’s very quick. I think seven days is old Linis likes to give people, Linis is very vocal about security fixes just being fixes. He believes there’s a lot of theater in security. So, he’s very pushy on just getting fixes into history as quickly as possible. And that’s, it’s a fixer bug. We need to get it upstream and send it out. But we’ve built up more like Greg has building up like a CVE sort of handling system for the kernel. We’re our own CVE handler but we have a lot of potential security fixes in the kernel. And over the last year you may have seen some news articles and discussion about that and like, we’re, we can never really say with a lot of things for sure if they’re a security problem because it’s a security problem in the kernel, it may not be your security problem in your kernel and you need to be in charge of that.

Dave Airlie 00:34:37 So if you’re a downstream of the kernel packaging it up in your operating system, you probably need to be more on top of this that the upstream kernel is used in so many different ways that every specific CVE that we issue for the kernel may not apply to your use case. But the subtlety in the security industry isn’t quite there yet in that there’s a lot of CV scanners that just take blanket approaches. There’s a lot of so security area is definitely a tricky one in terms of process generators, a security milling list, if it’s in your area, you’ll get cc’d on the report. You are expected to bring in the other people in your area who are responsible. So, I can get the top-level cc if it’s not something I directly could fix or know about, I’ll have to bring in people from the people involved and, and then we’ll keep it secret for seven days if we can’t, if we can get a patch out, we’ll get a patch out.

Dave Airlie 00:35:23 But yeah, generally, lately I think AI generated reports have started saturating the bandwidth as well. Some of it are good some of it are bad but probably its a bandwidth saturation problem. The bad ones will still calls in time. In terms of dealing with other types of regression like performance issues and things like that. Quite often they just sort of come up over time. We try to do, in the last few years we’ve moved into CI a lot more. But one of the biggest challenges with links development process is it doesn’t really integrate well with how CI works. And it’s where I quite like having, I also work on some other projects where we have full GitLab CI processes where every merge request goes through a full CI run before being merged. I like that it makes life so much easier, but the Linux kernel is not designed for that because of our workflow.

Dave Airlie 00:36:14 There’s no like central point to pick these patches up and push them through a centralized CI system and everyone has built bespoke CI and again, a central CI wouldn’t scale. Like one of the problems with CI is it like it needs resourcing and if we had a look central it, the scaling is going to kill. So having them per sub system and having them more bespoke is probably a good thing. We have a graphics sort of CI pipeline, it’s not quite there yet. We’re working with, it’s been worked on over the last year. Some of the Qualcomm being quite pushing on it. The idea is that I would love to see a pre-emerge CI system that I would know that when a patch set gets to me, it’s gone through some sort of CI, but quite often that’s not true today. A lot of, so Intel have done a sort of a CI integration with the mailing list, which is janky, but it works as in they get patches on the mailing list, they apply them to a tree and they push them through the CI and they report that to a tool called patchwork.

Dave Airlie 00:37:04 I know AMD have some internal CI that they will push their next kernels through as well and it’s kind of very patchy the coverage in that area. But often that’s how we, initially we’ll find regressions and things like that are true that, but then Valve will tell us if they updated the new steam deck and it’s suddenly slower. Things like that will come through. Also, a lot of regressions in performance and stuff can be, it can look like it’s a graphics problem but it’s actually a power management problem. Because they’re so tightly integrated now all the pieces of your SOCs and your GPUs is that there’s so many pieces to them. Yeah, you could suddenly have a clock change and yeah that can cause some problems. So, it’s also you have to know the architecture of the whole system to track those down.

Gregory Kapfhammer 00:37:45 Thanks, that’s helpful. I wanted to pick up on two phrases that you used. You mentioned CI and I know there’s something called kernelci.org and before that you mentioned B four. Can you tell our listeners very specifically, like what are the things you run in CI and then how do tools like B four help you? Do you run them in CI or do you run them on your developer workstations? What’s the process there Dave?

Dave Airlie 00:38:08 Yeah, there’s no good central standard here at all. So, I’ll go B four is a tool that is replaces email for getting patches from you to the mailing lists. We’ve dev, uh, the two kernel tools group, Constantine is the main guy there. I have developed a thing called public inbox, which is a mailing list archive or holster that we use as kind of a repository of a database of all of the patches ever kind of sent to any mailing list. Everything is in this public inbox or the hierarchy. And B four is a tool then that lets you submit things to those inboxes. In reverse there’s a tool called LEI, which is for pulling things out and lets you query those inboxes. So, I use an example to query for all the pull requests in the last week, get them into a local mailbox and then I process them locally.

Dave Airlie 00:38:54 So there’s tooling that avoids using SMTP as an email server and lets you go beyond that. But in terms of projects like kernel CI and we have a GitLab CI project on free desktop, there’s not much sort of coherence there. I suppose the best way different subsystems kind of use them in vague ways. There’s a little bit more like it’s generally after the fact CI as well. It’s generally stuff goes through CI after it’s been merged IN’S tree after it’s been put in next. There’s very rarely we have CI in advance of the patches going into someone’s tree and that’s something I’d like to change. Like, um, I encourage people in my area to try and develop things to help me change that. I probably am not the person to do that, but I encouraged and open to workflows that allow that to happen because I see the value in it.

Dave Airlie 00:39:41 But yeah, in terms of overall kernel CI it has the name, but in terms of whether all the kernel goes through it is not really, that doesn’t actually happen. And again, we also had a number of robots over the year. Intel had a, I called zero day, which was something that would apply a patch, but sometimes zero day would be, you never knew when zero day was going to respond to you. Zero day could tell you about your patch a week later. It could tell you an hour later, it could tell you a month later. A lot of the these again are after the factoring, you send something to the mailing list and maybe you get a response trying to bring the window of time in on those and trying to constrain those. Again, it’s hard because yeah, there’s so many kernel patches and scaling are hard. So, the Intel CI is probably one of the, has lasted quite a while in that they get patches on the mail list. I see them apply, I see them runs and it just, it works but it’s very strong together with baling twine and stuff. It’s not a very coherent system.

Gregory Kapfhammer 00:40:30 Thanks for those clarifying comments about how you’re using tools and how you’re using CI. I did want to follow up with one final question in this area. Do you use various types of fuzzing techniques when it comes to the DRM subsystem or is fuzzing used in any other parts of reviewing patches in the Linux kernel?

Dave Airlie 00:40:48 Fuzzing seems to go through like popularity phases I suppose. It becomes popular for a while. They find all the fuzzing bugs and then it stops. We don’t have anti specific in our subsystem for it, but there’s an upstream project called CIS bot. CIS bot is notorious for finding the weirdest race conditions in the weirdest places in that and then providing you with very little information on how it did that. It literally intelligently spams all the kernel APIs in various orders with various random things like, so it’s not fuzzing in the sense of its sending complete garbage to things. It knows the structure of the Apple interface as it knows the structure of the system call interfaces. So, it knows it needs to pass invalid file descriptors to get passed because otherwise 99% of your fuzzing stops at the first block. So, it’s intelligent enough to try and get down into the kernel depths. It’s a quite a cool project. But yeah, we don’t specifically use that, but we respond to reports of that. But again, it comes down to you can get overwhelmed with auto generative reports of CIS color problems because some of them are so deep and so niche that no user is ever going to hit these. They can often get pushed to the background. They can be that one in a million problem and that’s good to know, but sometimes we have to over prioritize them and that’s often a bad thing.

Gregory Kapfhammer 00:41:55 Okay, that was really helpful.

Gregory Kapfhammer 00:42:26 So I wanted to talk briefly about how people get involved with the Linux kernel maintenance process. I know you have this idea of new people submitting a patch or making a suggestion for a change. How does it work if one of our listeners finds what they think is a bug or has an idea for a performance improvement, can you give us a few insights on what that process might look like?

Dave Airlie 00:42:48 The world has changed over the years. I think the best way to get involved now is to get a job or to tell you to get involved. But secondary that if you have a personal itches are generally the best way to start on something. If it’s a problem you have on your laptop or in your system that you are seeing, then yeah, you have sort of an incentive to want to go down that road and dig into it. And the code’s all there. I encourage people to go at it looking into it. Like use your ChatGPT or whatever, use Gemini. Use those to even interrogate what you think about the code or trying to figure it out. Work through. If you find the area of the kernel the problem is in, well then find the community attached to that area. There is no single Linux kernel community that there’s the Linux kernel mailing list.

Dave Airlie 00:43:27 That is not the place you start. The Linux kernel mailing list is where you drop patches that kind of you aren’t sure what to do it and very often it will get ignored. It’s more of a; it’s not a central place for development. You have to find your community. So, if you want to work in the graphics, you will. Okay, well this is a DRM subsystem. Where are those developers? Are they on, they’re on their DRM GitLab or they’re on can I find them on IRC maybe? Can I find them on discord? Is there a community of people I can actually talk to and say, is this a good idea? You could send a mail to the maintainer. Quite often you might get a reply to maintainer dependent because some of us just get so many emails that they don’t all, we don’t get to them all.

Dave Airlie 00:44:02 I try to get to a lot of the new one, new people asking questions by diverting them to the person who could better answer that question in the area, but not all, it’s a steady thing. But yeah, find the community, find out how they work, lurk for a while. Like maybe watch their workflows a bit, maybe read a few patches, maybe get an idea of how the patches in that subsystem are reviewed. Maybe review a few patches If you’re feeling that you’ve know enough C and you’ve learned enough about the area, so like you can start by, you can always do a drive by patch. Yeah, that’s where you just drop the patch and hope someone applies it and never look at it again. And that’s often for small things, that’s probably fine. But if something that you are actually interested in getting into kernel development, that’s often the best way to start is find your personal itch that you want to scratch and then work with the community to figure out to do that.

Dave Airlie 00:44:45 I do get off asked as like, can you give me a project to work on or something to start me off? And that’s a tough question because without knowing your level of ability and skill, without having mentored you on something, I can’t really give you that. We’ve tried to construct to-do lists and they work in some ways. We’ve had some success with keeping to-do lists upstream as long as we maintain those and that give sort of levels of easy versus hard. But quite often you find like if an experienced developer hits a roadblock that is just a simple to-do list item, they’re going to go straight through it. It’ll be a patch 10 minutes later. So, you have to have things that core developers don’t want to fix right now because otherwise they would just fix them right now because they’re important. So it’s like it’s hard to find that balance of tasks and we often have new projects that are up and coming where there’s like a, the project has started like our NOVA project, which is a GPU driver that started, it’s like people want to get involved but it’s like, it’s kind of at a problem stage where we can’t get people involved because we haven’t opened up the window where it’s parallelizable right now.

Dave Airlie 00:45:40 It’s very serial. We need to get certain things done and we need to get to a baseline, and we haven’t got there yet. So, it’s hard to get people to join until that baseline is achieved unless they know some of the, the deeper parts of it or are working on a full-time.

Gregory Kapfhammer 00:45:52 Thanks, that’s really helpful. Now a while ago you mentioned the idea of a patch series as having like a cover letter. So, what goes along with a patch series beyond the cover letter, how do you show like what its tags are, how it was tested, how it was verified, what fuzzing it went through. Can you give us a little bit more of a sense of what a patch series looks like for the Linux kernel?

Dave Airlie 00:46:15 Honestly essentially a patch of letter, none of that generally goes into them unfortunately. I think we would like that to happen more. But often the cover letter is just often it has like the list of like said versions. So, say you sent a patch series out, once it’s been reviewed, it’s probably hit a few fuzzers. Somebody may have gone through a CI system the second time you send it out, you may include some of that feedback into the cover letter. You may say changes since the previous version are because I got this feedback from a reviewer. I got this feedback from an AI reviewer. I got this feedback from the CI system; the CIS color found a problem. Something like that is often in there but it’s not essential. It’s not, we don’t have to put in that it went through a CI.

Dave Airlie 00:46:53 Often I say the CI is after the fact. So, we review the patch more like aesthetic and, functionality and then when it’s merged we figure out whether it’s actually a major regression we didn’t spot. And to be honest we don’t spot them all. One of the advances we’re seeing with AI reviewers is that they find some of these things that we humans just aren’t. We might spot them one day and the next day we might not. And we’re not, the consistency isn’t always there. So yeah, it just goes down to ideally yeah it would have all the information that’s pertinent to the person applying the series to make it easier for them to apply your series. But in practice it varies between two lines and essays .

Gregory Kapfhammer 00:47:30 Okay, that’s helpful. I remember before you mentioned the idea of long-term support or LTS and we’ve already talked about things like backporting, but we haven’t yet defined the word downstream kernel. What does downstream mean?

Dave Airlie 00:47:43 So from the kernel sort of view downstream is any kernel that isn’t SI suppose and maybe I can pull it out to being any kernel that isn’t stable. So, Linux and Greg Street, I think that’s not theirs and isn’t feeding into them. So, I wouldn’t consider like the DRM tree being a downstream because we are going to feed our tree into Linux. But if you are a consumer of Linux’s tree, like Red Hat is through REL, through Fedora or Android has like a ChromeOS has Ubuntu. So, any of those sorts of kernels that are not just taking Linux kernel and packaging them up, those are what we would consider downstream trees. Any trees that have diverged from the Linux core and added back ports of their own drivers, of their own things that haven’t gone upstream of their own. So yeah, a bespoke kernel that is for some use case.

Gregory Kapfhammer 00:48:30 Okay, that’s really helpful. Now one of the other things that I remember you talking about before is what I might call a cross-subsystem change. So, like for example you mentioned how power management could influence things that are related to DRM and as a longtime Linux user myself, I have had that happen. So, can you talk a little bit about cross subsystem changes in the Linux kernel and how do you manage that? It seems incredibly complicated.

Dave Airlie 00:48:55 It’s one of the more difficult aspects I suppose in some ways to handle because you often get a bug report in through your, through your handling system and you are then going, okay well now is this my problem? Then you go looking for it and you dig through the kernel, and you decide it’s not your problem. But how do you then tell the person who submitted that book report you need to go annoy these other people. Subsystems will go and bring that other per other subsystem in and try and talk to them like we have good relations between the power management people and the GPU people because we know we need to have those problems worked out. But like, we don’t, if somebody comes in and says my network card suddenly broke my graphics card, then it’s started. That’s another level up of how we figure that out.

Dave Airlie 00:49:31 Like, those sort of problems. Quite a lot of that though is often mitigated through the core kernel as in like you try to use standard kernel APIs, you don’t try to make cross dependencies between your areas without having them go through some sort of standard core kernel piece. And often the core kernel piece is then maintained by people on both sides of that sort of divide. So, we have things like we call the DMA buff subsystem, which is like a way of sharing memory between devices. We have that so graphics people will use that but also our DMA people will use that. Um, but we will talk through the DMA buff layer and when we have issues or bugs, we’ll resolve them around that sort of the community around DMA buff will be a slightly different community in the community around the core graphics or the mode setting team. So, it’s a, yeah I’d say but power management has often got a little community around it of people who are both GPU people and power management people. So again, it’s mostly like a social aspect of trying to get the people to actually talk to each other and trust each other that when they say they have a problem they believe each other.

Gregory Kapfhammer 00:50:24 So can you tell a story or give a concrete example of like a, some type of cross subsystem change and then how you as maintainers actually decided quote unquote whose fault it was and who actually had to make the fix?

Dave Airlie 00:50:37 Quite often we don’t hit that many at that level quite often like big subsystem, cross subsystem changes are well planned out. We will talk about them for maybe could be a year, could be six months out. Like we want to make a big change to a kernel API. It’s either something that’s very automatic or I don’t ever see it. It happens completely in someone else’s tree. It goes to Linux and I, all I get is the fallout when I have to fix the conflicts and stuff. But that’s just part of the job. At a more core level, a lot of the time producing core functionality in the kernel that is cross subsystem is where it gets tricky getting that like things like DMA buff into the kernel and then making it usable for everyone. Those discussions are harder because you often have a very opposing views on how these things should work and some people will take very strident positions and not want to move.

Dave Airlie 00:51:23 And how do negotiation on those topics is actually quite a lot of my job is impedance mismatching. Your engineers who are very focused on how they think something should work and engineers who think other things should work and it’s like, yeah you’re both right but that doesn’t solve the problem. It’s like I need you to come together and work out how it’s going to work between you, and you get personality issues there. And again, part of my job and I think part of why I like doing a lot of maintenance work I do is dealing with the people and just solving those problems between them. But that’s quite a lot of it is just that social trust aspect is very important. And again, you find if you build, one of our that’s come up and Simona talks about this quite often is that when we have developers-built things inside companies and then upstream them, we often don’t get that commonality.

Dave Airlie 00:52:07 But it’s not even often about the commonality of the code as much as about having the community of the people who understand the problem talking to each other. So, we’ve had this problem with, I suppose one of our graphics drivers went a bit rogue a few years ago and did a whole load of development work internally in the company and then decided they were going to upstream that work. And when they started upstreaming that work, we started seeing major structural problems with some of the other work they’d been doing leading to that work. And there was a, the driver had diverged maybe by a year or two we’re not closely monitoring the trust system is there for a good reason, but the problem that came out of that is like they literally that didn’t understand the problems that other people were solving.

Dave Airlie 00:52:44 And again, we often have a case where someone will go, okay well I’ve got this problem. Oh, we tried that five years ago and we couldn’t do it. It’s like, well where did you try it? Oh, we did it internally. And it’s like, well if you’d done that in public, we’d have that evidence that you tried it so that a person wouldn’t waste their time. So it’s building up that sort of community infrastructure, especially with new technologies are coming along like memory management, things have changed on GPUs and stuff and building up that knowledge base and that core of people that understand the problem that in the industry is that in each company there’s like one or two of these people but you want all of the industry experts in the same place to solve that for Linux. And that’s getting those five or 10 people to talk and trust each other is the tricky part.

Gregory Kapfhammer 00:53:20 Yeah, thank you for saying that. You’ve used the word trust several times and you’ve talked about how this is as much a technical issue as it is a human issue as well. And that makes me think of perhaps like a bigger picture question. As you’re going through an RC I’m guessing you must have some rough idea of, hey this RC is really healthy, it’s going well, or oh this RC is really struggling, we’re going to have to go to five or six or beyond. Can you take a big step back and give us a sense of how you’re gauging like the overall health of a release cycle?

Dave Airlie 00:53:52 Linus is really good at this. I’m not as good. Linus is, this is one thing you’ll see in all of his weekly RC emails. He will always give us sort of a feeling health warning of, oh there’s a lot more patches in this than I expect. Like he does a lot of trends watching. So, he’ll watch the trends in the last like five 10 kernels and he’ll have an innate knowledge of this feels big, this feels like there’s too many patches. Some of these fixes are big fixes like so he’ll often look at the tree and say, oh there’s like way too many patches here. But then he’ll look at sampling of those patches and say, oh they’re all quite small, they’re all quite self-contained, they’re fixing one small bug, it just happens. We’re fixing a lot of small bugs. But if he does that and he sees large changes like modifying multiple things in the subsystem that aren’t reverts, like well generally reverts are fine, but something that’s doing a pretty intricate change to fix a bug, he will get a bit antsy and he’ll be like, okay, no, I’m not too comfortable releasing RC7 straight to final.

Dave Airlie 00:54:42 I might think we should do an RC8 quite often security issues can cause things like that, like specter meltdown, things like the big ones. Those cause that problem for me personally, yeah, I keep an eye on sizing as well when I, because I generate my weekly request to then I’ll keep an eye on sizing generally. I also keep an eye on the each. So, every week I generally get maybe four standard pull requests. I get our miscellaneous tree, I get the AMD folks, the intel and two trees from different Intel teams and those are like my baseline. If those don’t come in by Friday, I’m like, something’s gone wrong. But then there’ll always be maybe one other like, so maybe MSM will come in with like the Qualcomm fixes for the, they don’t do weekly, so like they might two or three weeks I’ll get a big bunch of those and sometimes those ones are like, okay, there’s a lot of changes in here, could we have broken this up or is there some other things?

Dave Airlie 00:55:25 And that’s sort of like I, I go down that gauge that, but for me, yeah, generally I find by RC four or five things do taper off. We do generally get a sort of a, like shorter things drop down to like the whole pull request being in less than a hundred lines or 200 lines. If I start seeing pull requests or I’m seeing five or 600 lines change that after RC5-6, I will dig in a bit and look into it. But yeah, it’s kind of an instinct built over time that you just, your comfort levels. We usually find the really big problems really and it’s then comes down to you. With graphics, there’s so many, like if you break something central it’s obvious but laptops are so diverse. There’s so many different ways laptops can break and panels.

Dave Airlie 00:56:02 Like it’s not even one laptop. It’s like every model of your laptop have a different panel and the panel. So, it’s like those things are always going to be on a bit more of a long tail but they’re not showstoppers. Like I generally treat a regression where somebody has a black screen as kind of a showstopper because it’s like, yeah, having a black screen means I can’t install Linux. I try to try to get those fixes as quickly as possible or by reverting or moving on pushing people. But, but yeah, it is quite hard to describe a showstopper when you’re in the area of like, yeah, if it mostly works and it applies to the kernel as a whole, the kernel even at RC1 mostly works pretty well. It’s very rare even at RC1 that it’s a complete like what we call a dumpster fire, but the potential is still there. But the process is built quite well now that that’s a lot rare than it used to be. Like it used to be installing RC1 on your laptop, it was like, well I have a file system tomorrow.

Gregory Kapfhammer 00:56:46 Okay, that’s really helpful and I think it gives us a good intuitive sense of what we mean by the health of an RC. In the moment I want to talk a little bit about how AI is changing your landscape, and I also want to talk briefly about Rust and the Linux kernel, but before we move to that phase of our show and then draw the episode to a conclusion, is there anything that we haven’t covered that you thought we should cover?

Dave Airlie 00:57:09 No, I think we’ve gotten sort of the basic level topics.

Gregory Kapfhammer 00:57:12 Okay. And if listeners are curious in wanting to know more, I’ll pass along some links in the show notes both to other episodes of software engineering radio and in many of the resources that Dave and I have gathered. Now quickly, Dave, I wanted to talk about using AI because you’ve mentioned that several times. You talked about how you could have a coding agent do a review of part of the Linux kernel for you and then maybe point out a potential regression or a security concern. Can you tell our listeners a little bit more about how you’re integrating AI into your maintenance workflow?

Dave Airlie 00:57:44 Well, over the last sort of year, this has become a, like a topic of both interest and contention, I suppose, in the community. But the way I personally got involved was that the last maintainer Summit, Linus said like, oh, I’m actually seeing good reviews coming out of some of the, like he had access to some people at Google and some people at Meta who had been doing some of this work and they’ve been showing him some of the reviews and he was sort of, I’m quite impressed. And he says one thing that happens every year at Maintainer Summit is we complain about not having enough reviewer bandwidth, not having enough reviewers. It’s like, it’s a common refrain and he is like, if this can help or find problems, that’s a good place. We should probably start looking into this.

Dave Airlie 00:58:22 It’s like, okay, that’s a pretty good that he’s interested. I talked to a guy called Chris Mason at Meta has been leading their work on this. He actually wrote framework for regression finding, for using AI. So, it was very specifically focused on prompting AI to sort of find regressions. And how he tested it was by running it on older kernel patches and then finding fixes for those older kernel patches. And if those fixes were for regret and if the AI identified the initial problem in the initial patch review, then obviously the system is useful, and he’s found that it was catching regressions. I can’t remember what the false puzzles, but it’s in the 50%. So that’s interesting because if we can avoid having to put fixes patches in by having an automated reviewer and give you feedback, then why wouldn’t we?

Dave Airlie 00:59:05 Even if there’s noise in that system, it’s still better. Like the signal is 50 percent’s a pretty good signal noise, even if it’s not that finding those regressions is still a lot of work. It’s expensive. Like when that regression actually gets into the Linux kernel and gets out into the world that costs like actual money in the world. Even if you are, the process isn’t seeing that and having an AI find those, yeah there should be value in that, and we need to work on it. What I personally did, and this is due to it, like I, we also have a lot of AI stuff dealing within Red Hat. So I was like, well I, I was like, I’ll take that mandate I have from Red Hat and Linux and sort of put together, so I wrote it, I used Claude to write a Cloud reviewer where it just passes the patch series to Cloud Opus 46 and then gives a good like patch by patch review.

Dave Airlie 00:59:47 I didn’t concentrate in regressions as much, I just wanted more of an overall review of the series and a patch-by-patch review gives a basic set of problems. Now I haven’t enforced this, I haven’t pushed this onto the community. I built side public inbox infrastructure that was there on free desktop. So, you could actually use the same tools to pull the patch reviews as you could pull patches with using an alternative server. All the treading should still work. So, there’s an option right now. I’m also kind of using it as a reference. So I’m running it manually at the moment and generating it every day or two, but it’s like if I see regressions and then I can go back and look at this reference and say, look it spotted that problem and now we have this problem months later I’ll get that feeling that maybe we should be pushing on this more?

Dave Airlie 01:00:27 But even last week or two, Google unveiled, I’m going to say it wrong — Rashacticore, or something, I think it’s called — a patch review system. So, I need to catch up on that and maybe get them to start reviewing the DRM patches using that system if they’re interested because my one is a bit janky, but it, it was doing the job for me for what I want to figure out. But I think the value at the moment, people are definitely seeing value in AI patch review. Code generation, I still, there’s potential and I don’t think I’ve seen a few patches generated with the help of Claude and they haven’t been horrible. So again, I still, yeah, I’m quite open to see where that goes. I definitely think there will be potential for doing it and we’ll pick up I would guess in the next year, I think Opus is and Gemini are starting to get to the level where you, you can actually do some actually decent stuff with them. So I think it will develop heavily in the next year.

Gregory Kapfhammer 01:01:11 I know that we could do an entire episode just about using AI to do code review in the Linux kernel and I hope you’ll continue to write blog posts on this topic because I’m sure our listeners will be excited to learn more. Before we end the show, I did want to talk about one other thing that I remember reading on your blog, which is the integration of the Rust programming language in the Linux kernel. Once more I know that we could do a full episode on this topic, but can you give our listeners a little sense of how Rust is playing a role in the Linux kernel?

Dave Airlie 01:01:40 Yeah, actually my reasoning for using, getting behind the Rust effort is actually not probably what people would expect and I think you could probably do a really good episode with some people I know about Rust and Linux kernel from a technical standpoint. My push for getting behind the Rust effort was actually more social and community focused. I was at, I think it was Linux Plumber’s conference maybe two years, three years ago. Can’t remember his exact timelines. So Maintainers Summit at the Linux 30 people summit is co-located due to with plumbers conference and we have our private meeting and the conference, one of the big complaints in the private meeting over the years as we’re all getting older. There’s, we’ve got a lot more gray beards, like people are in their fifties now that were in their twenties when they started this journey.

Dave Airlie 01:02:20 So it’s definitely the community is aging and simultaneously people complaining about that conference. I met some people from the graphic subsystem just chatting, talking to people in the hallway tracks and one guy came up to me and said, oh, I really like working in the graphic subsystem. There’s loads of, there’s a young group of people working here, they’re all very interesting, excited, I like going to conference and meeting them and like, I didn’t know this existed really because I don’t work down at the level, but she said there’s this good community that she likes being a part of. And I was like, okay, I didn’t create that but I’ve, we’ve facilitated that in the DRM subsystem by being open to different things and not being so strict in our methods of developing the kernel. But then as well, the next time I was talking to some few of the Rust people and I was like, these are very young people.

Dave Airlie 01:03:01 These are again, a group of people who are in their like maybe in their twenties, some in their thirties but they’re a younger cohort of developers than the people I’m normally used to dealing with. And I was like, look, I think there’s a good way we can bring these together. I think having young people coming into the kernel using roster as a mechanism is valuable and having it that different perspective and having a different way of doing it. And so, I was like, well if I’m getting feedback that there’s young people in the DRM subsystem and it’s an interesting area that to develop that in and there’s, well I think I should be supportive of putting Rust in the kernel. I see it going to happen, I’ll help accelerate it, I’ll facilitate it and I stood up at the Rust Conference mini conference and said, yeah, I’m open to doing this, let’s do it.

Dave Airlie 01:03:41 Now, in terms of when I say that of me doing anything about it, I didn’t do anything about it. I just said I was happy to have people do something about it. And that’s been my, I suppose my working model over the years is like, just say you’re open to the idea, you don’t have to do it. It’s like I’m willing to make space for this to happen and facilitate it. And what I actually ended up doing was going back inside Red Hat finding one of my engineers who was working on stuff in VO at the time and it was a really good C developer and suggesting to them like, they don’t work for me, they’re on one of my teams but I’m not a manager. I was like, would you be interested in Rust?

Dave Airlie 01:04:15 And they went through the cycle of I don’t really like this, I don’t really like it now they’re, I really love it now they’re one of the Rust core maintainers. So, it’s like encouraging building that group. Like, if you wanted to talk about the technical aspects of Rust of the kernel, talking to someone like Dan Low or Miguel would be ideal because they really know the details of it. For me it is just about bringing in new people into the community and expanding the Linux kernel community and like I see the safety aspects and I see those aspects of it and I actually see good value in them. Especially we have a graphic subsystem, we have a very wide API to the user space, it’s often exploited. So, I would rather we don’t have that , I’d rather that’s a safe API as much as possible.

Dave Airlie 01:04:53 And same with the hardware that we keep that safety concerns with the hardware interfaces as much as possible. I see Rust accelerating the ability to deliver GPU drivers. I see it being much faster that we can build a lot more common infrastructure. We can avoid making a lot of the C mistakes that we’ve repeatedly made. A long-time C developers will say we could do that in C and I’m like, yes, we can, but we’ve repeatedly not. Yet review can catch this, but we’ve repeatedly failed. It’s like yes. Like these are all the fact that yes, if you have the best C program in the world on a bad day, we’ll still make that mistake. Rust gives you a lot more, you can’t do that. The compiler stops you can build into your type system. The fact that the compiler will stop you doing let you do stupid things and especially with lifetimes, lifetimes are one of the hardest things to get right in the Linux kernel. Lifetime, we’ve made this like 20 years of lifetime mistakes. It would describe a lot of the kernel without development process. Like we have made every lifetime mistake over the years in terms of from when we move from single processor to S&P but even moving that to treaded like interrupts and all these things, all these changes over the years find new ways that we’ve messed up our lifetimes. Hot plugging devices, lifetimes messed up, they’ve all involved bring designing new things and, getting that right and embedding that in your Type system is quite valuable that stop making those mistakes.

Gregory Kapfhammer 01:06:08 Yeah, thanks for all of these insights. It’s really been a lot of fun to have this conversation as we’ve been focusing on Linux kernel maintenance and I recognize there’s many technical details that we didn’t really cover, but I’m now interested in you taking a big step back as we conclude our episode. So, you’ve been working on kernel maintenance for more than two decades. Can you tell us a little bit briefly what has stayed the same and what has changed? Give our listeners a little bit of the picture as we conclude the episode.

Dave Airlie 01:06:36 Yeah, it’s an interesting question. I think the essentials of it have stayed the same. I think the basic, you write a patch, you fix a problem, goes through the tree, that has the core of it has remained the same. There’s a mailing list, there were people, but the scale of it has so much changed. And the level for yes, you have a patch to had gone from like yeah you have a patch, but now you should probably a lot more, you need to do more with that patch to get it to a quality standard that’s acceptable for the kernel. I think the quality of the kernel has changed substantially in those 20 years. The quality of our process for producing it has solidified into an actual process as opposed to what it was 10, 20 years ago where it was just like randomly just there was no timelines, there was no process.

Dave Airlie 01:07:16 It was very shoot from the hip I suppose. So now it’s like, yeah, it’s very formalized. It’s very structured I would say it’s like can be less fun, but less fun is important in that like yeah, you can’t make suddenly wide-ranging memory management changes that destabilizes half the world’s computers. That’s probably not a good thing to be doing anymore. So, it’s like you definitely have, it’s matured a lot, but I still think the core aspect of hacking on hardware devices and stuff is still, at least for me, very appealing. Like I still, yesterday I was writing code for Nvidia GPU drivers even though it’s just stuff to do and it’s like I still like having a hardware device. I just got an Nvidia Spark box and I just getting that lit up with open-source Linux drivers is like something that still appeals to me even 20 years later, like that whole aspect of it.

Gregory Kapfhammer 01:08:01 So thank you very much for your enthusiasm. It’s really contagious. And if I could say on a personal note, for someone who’s been using Linux for decades myself, thank you for all of the work you and your many colleagues have done because for me, using Linux has been a genuine source of joy in the same way that this has been a joy filled and awesome conversation. Dave, is there anything that we haven’t covered that you think we should conclude on?

Dave Airlie 01:08:25 No, I think just people, if you’re interested in the Linux kernel, just get out there and find a niche, scratch it, submit some patches the community’s out there and we are always happy to have new people join.

Gregory Kapfhammer 01:08:34 Okay, thanks for all these insights, Dave. This is Gregory Kapfhammer signing off for Software Engineering Radio.

[End of Audio]

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

Show Notes

Related Episodes

Other References

Transcript

Join the discussion

More from this show

SE Radio 728: Clare Liguori on the AWS Strands SDK for AI Agents

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

Menu

Recent posts

Search

Search

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

Show Notes

Related Episodes

Other References

Transcript

Join the discussion

More from this show

SE Radio 728: Clare Liguori on the AWS Strands SDK for AI Agents

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

Menu

Recent posts