SE Radio 634: Jim Bugwadia on Kubernetes Policy as Code

Jim Bugwadia, CEO of Nirmata and a committer to the kyverno projects, joins host Robert Blumen for a discussion of policy-as-code and the open source Kyverno project. The discussion covers the nature of policies; policies and security; policies and compliance to standards; security scans that generate reports compared to tools that allow or deny operations at run time; Kyverno as a kubernetes service; the Kyverno helm charts; the components of Kyverno; bootstrapping a kubernetes cluster with Kyverno; installing policies; implementing policies; customizing policies; packaging and installing policies; kubernetes dynamic admission controllers; the Kyverno admission controller; securing Kyverno itself; observability of Kyverno; types of reports and messages available to cluster users.

This episode is sponsored by QA Wolf.

Show Notes

Related Episodes

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Robert Blumen 00:00:19 For Software Engineering Radio, this is Robert Blumen. Today I have with me Jim Bugwadia. Jim is the co-founder and CEO of Nirmata. He’s an advocate for cloud native computing best practices. He’s a chair of two working groups of the Cloud Native Computing Foundation, Kubernetes Multi-Tenancy and Kubernetes policy. And he’s a committer on the open-source Kyverno project. He’s a frequent speaker at conferences such as Cloud Native Security Con. Jim, welcome to Software Engineering Radio.

Jim Bugwadia 00:00:54 Thanks for having me, Robert. Pleasure to be here.

Robert Blumen 00:00:57 We will be talking about policy as code and Kyverno today. Before we get started, is there anything else about your background that you’d like to share with listeners?

Jim Bugwadia 00:01:08 Sure. So I’m a software engineer, still actively, of course, contributing to multiple projects. I started my career in software engineering in the telecommunication space, so building distributed systems in a very different manner than what we see today. So I worked at companies like Motorola, Bell Labs, Lucent, and now as you mentioned, focus more on cloud-native systems.

Robert Blumen 00:01:33 Great. And that’s what we will be talking about today. I know from reading the documentation that Kyverno is a policy management tool for Kubernetes. We’re going to get all into that, but let’s start high level talking about policies. When we are talking about these kinds of policies, what are we talking about and how are these managed policies distinct from, there are a number of things in the Kubernetes space that are also called policy.

Jim Bugwadia 00:02:00 Right? Yeah. So policy is quite an abstract and vague term, right? But if you kind of think about it, in our real lives, in our day-to-day work, we have policies for things like expenses and vacations and things like that, which are just written somewhere. These are documents that we share, and we all want to abide by within an organization. So similarly, if you think about what’s happened in IT in the last let’s say 10 or so years, we’ve moved from system administration to DevOps to DevSecOps. So we have more and more collaboration across different teams, different groups, that’s required. And what that brings in is as you are sharing configuration, as you’re managing these increasingly complex and large systems, you need some form of digital policy, which everybody is going to look at in the organization and abide by. And some of these policies may be because of regulatory compliance, even across the industry like PCI, HIPAA, et cetera, which are in financial systems, in healthcare, or they might be internal best practices, which are set up. But then again, in this form of policy, we’re really talking about a digital artifact, which all different collaborators can look at, can understand what that means, and know exactly how to apply that within their domains itself.

Robert Blumen 00:03:27 It might help if we could get more specific. I noticed in the documentation site for Kyverno, there’s a section which lists perhaps several dozen categories of policies. What are some of the categories of policies that are managed by Kyverno?

Jim Bugwadia 00:03:44 Yeah, great question, right. So Kyverno started life in Kubernetes within the CNCF. And as you may know, within Kubernetes that the unit of deployment and management of any workload is a pod. So in Kubernetes also all configuration is very declarative. So you tell the system how you would like it to behave, and then various controllers go off and do their job and try to bring the current state of the system to the desired state. So starting with that context, if you kind of go back to every workload and developers want to specify the configuration for their workload, they would write several different things for in and Kubernetes declarations are in YAML format. So they would write things about how many replicas their pod might have, what types of resources their pod has, which container images the pod needs to run.

Jim Bugwadia 00:04:44 So all of that gets specified in a pod declaration. But then the pod declaration also has things like a security context, which every container there’s certain security rules or security configuration you want to attach. It may have things like a note selector. So again, you’re within that same declaration, within that single YAML artifact, there’s things that the developer cares about, there’s things that the ops team cares about, and there’s things that the security team cares about. So a very concrete example of a policy for security is within that pod to make sure that the security context abides by certain rules for best practices to make sure there can be no container breakouts or privilege escalations, things like that for a workload. So that’s something a security team can define as a policy in Kyverno and can deploy that across all their clusters. Kyverno operates as an admission controller, so anytime there’s a change request within a cluster, Kyverno can intercept that request, understand what that change means, and apply the set of policies required to either allow or deny that request.

Robert Blumen 00:06:00 So you just gave us one example of the workload permission. Could you give another example of a policy that I could download or view on the Kyverno website?

Jim Bugwadia 00:06:11 Absolutely. So one extremely simple and common example is you want to make sure that every workload has certain labels, right? And labels are used for best practices, for organizing data, for querying, things like that. So ensuring that your organizational labels are set like the team ID or something that correlates who ordered that workload or who’s requesting or running it. Because Kubernetes and cloud native environments tend to be shared. So you have heterogeneous multiple workloads working on common infrastructure. So things like labeling becomes, that’s a simple policy. Another example would be like every time a new namespace is created in Kubernetes to automatically generate some secure defaults, like for networking, the firewall rules, what traffic is allowed in and out, off that workload, those sort of things you could also generate by default.

Robert Blumen 00:07:10 Security related tools. We could perhaps classify them into these two groups, which do scans and give you a report of things you need to fix and other things that are active at real time that will block you from doing anything you must not do. And it will allow you to do things that you may do. Can you easily put Kyverno into one or the other group, or does it have elements of both?

Jim Bugwadia 00:07:34 It does do both. But the main value there is that proactive enforcement. Because there are, like you mentioned, there’s several scanning tools which can react to configuration that’s already in production, but by the time something’s in production, it’s too late. So what you want to do is you want to prevent invalid configurations from going to production. If you look at all the security headlines, the common results are about 80 to 90% of security issues are because of misconfigurations. And the real value proposition of a tool like Kyverno is preventing misconfigurations as early as possible in your software development lifecycle. And we’ve all heard about shift left in security? With Kyverno, we think of it as shift down security because we’re baking this into the platform itself.

Robert Blumen 00:08:26 We’re going to get more a little bit later into some other things you’ve mentioned, like the controllers and how the policies are written. I want to stay for a minute at this high level. You mentioned that many organizations are driven to adopt policies in order to comply with different standards. Like SOC, you have hundreds of policies pre-written on Kyverno website. To what extent do you have compliance in a box type solution where you could download 50 or a 100 policies as a package that would get you some percentage of the way toward a given type of compliance?

Jim Bugwadia 00:09:07 For Kubernetes best practices or security related configuration? Kyverno has a very solid and strong policy set out of the box you can just get started with. And that’s because the Kubernetes community also maintains something called pod security standards, which is a live document, which evolves with every release and Kyverno policies offer that. Now, if you move higher to standards like whether it’s PCIDSS, HIPAA those type of things, there’s vendor tooling like from my company Nirmata, other companies like Red Hat, and also like other cloud providers that would provide these compliance standards built on Kyverno policies or other policy engines as a complete solution. The challenge that we saw with Kyverno and what we wanted to address is, and we often kind of face this during the audit process, right? Every environment with Kubernetes, because there’s so much extensibility, different environments might have different sets of tools. So to prove compliance requires that flexibility in policies like one maybe one environment uses Istio as a service mesh, another uses Linkerd, and each one may have different set of best practices. So that’s where having the ability to easily, in a declarative manner manage this policy lifecycle as policy, as code becomes extremely important.

Robert Blumen 00:10:40 When we’re talking about now the management of policies, one example would be allow and deny. I understand Kyverno can also modify requests before they’re applied to correct them. Can you give an example of when you would do that?

Jim Bugwadia 00:10:56 Absolutely, yeah. So one simple example is if you are deploying a workload, and if it doesn’t contain any resource requests, now anything that you want to run on your cluster will consume some CPU, some memory, and perhaps some other resources like GPUs, et cetera. So it makes sense to have some baseline of requests, because otherwise what happens is the workload Kubernetes schedules it as best effort, which means that if there’s some other workload comes in and requests resources, the best effort workload may get de-scheduled or may get moved out of the certain nodes. So to prevent that, it’s important that any application that you expect to keep running, long-lived applications, have resource requests. So for something like these developers may not know what to set. So administrators can set a default CPU minimum as well as default memory minimum. And with auto tuning in Kubernetes, it’s possible to then adjust this based on heuristics and observability metrics that are collected over time.

Robert Blumen 00:12:07 In your example then the modification would be, if a request for workload does not have resource constraints attached, then Kyverno would apply a reasonable default to that request.

Jim Bugwadia 00:12:21 Absolutely. And it can tune that over time too, right? Which is quite interesting because based on in Kubernetes environments, typically you’re collecting metrics, you have things in Prometheus as a metric server. So Kyverno can integrate with the metrics server, check for resource consumption and tune that because the newer versions of Kubernetes now support vertical pod auto scalers, which allow in place updates to some of these metrics.

Robert Blumen 00:12:50 You did start out to tell us the history of the project. We got partway down that road. I wonder if, do you have an awareness of how standard is either Kyverno or policy management in general as one of the services that pretty much every cluster needs to run? Or where are we on that adoption curve for the concept of policy management?

Jim Bugwadia 00:13:15 CNCF runs surveys on some of this, and especially on their top projects, to see and measure adoption. So from the latest surveys, what we have seen is about 40% right now of the respondents are using some form of policy management. Kyverno has about like about half of that share. The other half is with another tool called open policy agent, which uses Rego as a policy language. So that’s another solution in the CNCF landscape for policy management. But to your question, and what is a good point is there’s still work to be done in terms of awareness that policy is really a must have for systems like Kubernetes. And you need some form of policy enforcement, whether you’re using Kyverno or alternatives in the community.

Robert Blumen 00:14:08 If I’m adopting Kyverno, I’m of course going to look through what policies people have already written, but then I may find nobody’s written the policy that I want. I want to first ask, can these prebuilt policies be parameterized or can they indirectly import settings from your cluster so that you can to some extent customize them the way you want?

Jim Bugwadia 00:14:35 Yes. So vernal policies, you can declare variables and you can pull this variable data from external sources, whether it’s config maps in your cluster, other controllers, you can even cache these periodically in a global cache that Kyverno offers. So there’s a lot of flexibility in parameterizing externalizing data, which may vary over time. Like in the metrics example, right? So if you’re checking with the metrics server, if that metric server happens to be in cluster that’s fairly low latency. You can make some rapid calls to it and check. But if you are doing that check with something off cluster, you might want to periodically pull down that data, cache it into your cluster, and then make a decision of whether to mutate or whether to allow or deny workloads, things like that.

Robert Blumen 00:15:27 Can you think of a situation either you encountered or maybe a user where they looked through the prebuilt policies, they couldn’t find it, and they had to write their own policy?

Jim Bugwadia 00:15:39 Absolutely, right. So we do see, and one of the, again, motivations for introducing Kyverno. So Kyverno started about two years after open policy agent. And what we noticed is, as much as, the community understood the use cases for open policy agent adoption stayed fairly low because of the complexity of writing policies in Rego, being a different language, being something which was a learning curve for Kubernetes admins. So when we started Kyverno, one of the guidelines for the project was, we want anybody who learns Kubernetes to be able to write Kyverno policies without any additional training or knowledge, or without any language to learn. So starting out with Kyverno is extremely simple. Literally you can go from zero to value in under five minutes. And then as you want to customize or write more complex policies, Kyverno does allow languages like JMESPath or CEL, which is a newer language, which a lot of Kubernetes controllers and Kubernetes itself is starting to adopt CEL stands for common expressions language.

Jim Bugwadia 00:16:50 So it’s another way of kind of declaring small pieces of logic or code within things like configuration, like YAML configurations. So yes, so it’s very common for folks to customize or write policies. We also see a lot of questions on our community channels. Kyverno has a very active Slack channel in the Kubernetes workspace. In fact, we are ranked like the second most active right after Kubernetes itself, which is interesting as a statistic. And we see a lot of questions on help with policies, things like that. As Kubernetes administrators are customizing these policies to their needs.

Robert Blumen 00:17:30 Now, looking at these policies, and you’ve mentioned they’re written in YML, but it looked to me like some of it was very declarative and some of it was a little bit imperative in that it was importing looping type concepts. And so could you comment more on what is involved in implementing a policy? What type of languages or libraries do you need to master?

Jim Bugwadia 00:17:54 Yeah, so the first thing is of course understanding Kubernetes itself, right? So most policies are, I would say the simpler policies which, like the bulk of the 60, 50, 60% of policies are fairly straightforward. They will mimic the structure of the resource that you’re trying to apply the policy to. So for example, if you’re applying a policy to a pod and pods have things like spec and every Kubernetes declaration the sort of the defacto way of declaring it, it has a spec element and a status element spec of course is short for specification. And within that you would have things like with, for a pod you would’ve containers within a container, you would’ve security context. So that’s how the YAML is laid out. So a policy to match something in a security context would follow almost exactly that same structure.

Jim Bugwadia 00:18:51 So it becomes extremely simple for somebody who understands how a pod declaration looks like, to be able to write a Kyverno policy that matches that structure and enforces some constraints on certain fields within the pod. So that’s a very easy, straightforward starting point. But then there’s things like you mentioned in a community spot, you could have multiple containers, and containers are organized as either a container declaration, which is the main, your application container, or you could have unit containers, you can even have ephemeral containers, which is a newer feature. So now, if you want to really enforce some security constraint, you might need to loop across all container types and all containers within each of those types and enforce some policy. So that’s where Kyverno has things like 4H as a declaration or has ways to apply. There’s another language called JMESPath, which is an acronym JMESPath. It’s commonly used for CLI and to process JSON in an efficient time-bound manner. So Kyverno supports that language. Common Expressions Language or CEL is also something that Kyverno one 10 onwards has added support for. And common expression language is used in Kubernetes in a few different places. So there are, as you get to more complicated policies, you will end up using either JMESPath or CEL, or in some cases both depending on what you want to accomplish.

Robert Blumen 00:20:28 If I want to constrain values, like something must be greater than zero, I can see that’s completely declarative. But I can’t imagine situations where I have, or I need to write a service in a high-level language. And the rule I’m trying to express is call this service and it will tell you whether you can do the thing or not. So I’ve essentially factored out a portion of my policy into another program that may be imperative. Is it possible to integrate that type of logic into a policy?

Jim Bugwadia 00:21:02 Yes. So Kyverno supports API calls to either internal Kubernetes services with bidirectional security with other checks. So you can call any other Kubernetes controller, or you can even call an external API. The only caution there is if you’re calling external APIs, especially if your policy is applying during admission controls, you need to make sure that it executes extremely efficiently and there’s low latency in those calls because you’re blocking any other API calls while that’s occurring.

Robert Blumen 00:21:40 I noticed on the Kyverno documentation page and discussed this a little while ago, there are categories and any, within each category, there are many policies. Does Kyverno have any concept like package management where I can say I want all the CNCF node policies as a bundle, and then it will go and grab at a larger granularity?

Jim Bugwadia 00:22:04 There is a way to organize, so Kyverno itself doesn’t do this, but there’s higher level tools in Kubernetes in the ecosystem, and of course other tools that build on Kyverno. But very commonly you’ll see the term policy sets, which like you’re envisioning is a bundle. It’s a group of related policies that you want to deploy and operate together. So one common packaging for anything in Kubernetes is Helm charts, right? So Kyverno policies, because they’re Kubernetes resources can be easily organized into a Helm chart. You can deploy that as a versioned unit. You can even put with tools like Flux and Argo CD, you can put that Helm chart into an OCI registry and pull it down into your cluster. So the beauty of Kyverno is because, the approach is to that policies are just Kubernetes resources. You use the tooling you would normally use for other Kubernetes resources to manage policy as code and that lifecycle as well. So you don’t need any custom tools, which other engines or other solutions require you to use that.

Robert Blumen 00:23:15 Got it. So Kubernetes already has a package manager, which is Helm. You don’t need to provide a new package manager for Kyverno because you use the one that everybody’s already. Okay, great. This last response you gave does start to get into another thing I want to cover, which is, how do you get Kyverno bootstrapped into your cluster? Clearly, I would like as much as possible of all the things I’m running to be compliant with policies, but you have to get a certain amount of stuff set up before you could even install Kyverno. So can you take us through where in the cluster standup does Kyverno fit?

Jim Bugwadia 00:23:56 Yeah, so Kubernetes has a concept of a control plane and then a data plane, which are the worker nodes attached to the control plane, right? And the control plane runs things like etcd, the API server, other Kubernetes controllers, like the scheduler, et cetera. So of course when you’re provisioning a cluster, the control plane components come up first and those typically run, if you’re running an HA configuration, the minimum recommended is three four consensus across availability zones or for RAF consensus, also for etcd. So typically you bring up your API server first. The other thing that Kubernetes clusters will require, and worker nodes do not go into a running or available state until you have a CNI installed, right? And the CNI is the container networking interface in Kubernetes. So you would usually install projects like either Cilium or Calico or one of those as your CNI, and then Kyverno tends to be the next thing you want to get installed before anything else is allowed, right?

Jim Bugwadia 00:25:04 So the order would be control plane components, CNI for networking, because if you don’t run your CNI worker nodes on that available and Kyverno installs as a deployment on the worker nodes. So you do need to make sure that’s up and running first and then Kyverno and then all of the other controllers you want to bring in. because policies need to apply to controllers as well, like Prometheus needs to be secured or is GO needs to be secured. So you want to make sure that Kyverno comes right after the CNI, but, and before everything else, all the other base controllers and then of course workloads, which app teams would then deploy subsequently on the cluster.

Robert Blumen 00:25:47 I want to refer our listeners to Episode 590 on Standing Up a Cluster and episode 619 on the Kubernetes networking where we cover the CNI. So now back to Kyverno, you said it installs as a deployment. Is there one or more Helm charts for Kyverno?

Jim Bugwadia 00:26:07 It’s a single Helm chart, and within that Helm chart though, there’s multiple controllers custom resources. So it’s a fairly full featured Helm chart, which installs a number of things on the cluster. Kyverno itself runs as four different controllers. So there’s an admission controller which receives requests directly from the API server. There’s a cleanup controller which runs for cleanup resources, there’s a reporting controller, which is responsible for reporting, and then there’s a background controller which can apply mutate and generate rules to existing workloads within your cluster. So those are the four controllers for deployments, which will bring, you’ll see within the Kyverno namespace itself, but it’s a single Helm chart which you can install again using any standard tools or GI tops tools like Argo CD Flux and others

Robert Blumen 00:27:05 You mentioned then it does have its own, its own namespace. Yes. If I listed objects in the namespace, and forgive you if you don’t have a hundred percent of this on top of mind, but what are some or most of the resources you would see in the namespace when it’s running?

Jim Bugwadia 00:27:23 Yeah, so in Kubernetes namespaces are the sort of security boundary and unit of isolation. So the best practice is to use a separate namespace for each workload. So Kyverno installs in its own namespace. In there you would see these four deployments that I mentioned. And of course, based on your HA configuration, you might see multiple pods for these. And you will see things like Kyverno will self-generate like a certificate which it uses to register with the API server. You might see other resources. So there will be a secret for that and that creates some other cluster wide resources internally. But all of this is fully automated, right? And a few other things you’ll see, like you’ll see at Kyverno config map, which is used for certain parameters to configure Kyverno, things like that. Within that namespace,

Robert Blumen 00:28:14 Is Kyverno a state full service?

Jim Bugwadia 00:28:17 No, it’s stateless. And the way it works there’s different, I guess, high availability modes based on which controller you’re kind of focused on or looking at. For the admission controller, it’s completely stateless and it scales out, which means you can grow the number of replicas to handle a higher load. You can of course scale each admission controller up as well. Other controllers, like the background controller or the report controller will run leader elections for certain tasks, which means that only one of them will be elected the leader within their cluster of services and will be performing a task. But if that leader goes down, there’s a immediate reelection, which automatically happens in the new instances elected as the leader and it will take over those tasks.

Robert Blumen 00:29:09 Can you say a bit more about why would it be important for a tool that is examining requests and accepting or denying to have a leader?

Jim Bugwadia 00:29:20 So there are certain things like say for example, I mentioned that Kyverno automatically generates a secret and a certificate to register securely with the API server, right? And it periodically checks whether that certificate needs to be regenerated, has expired, et cetera. Now, you don’t want all instances of Kyverno to be constantly checking that. So tasks like those are delegated to one leader instance, but of course it’s all stateless in the sense that, so it’s stateful at that moment in time. But if that leader goes down for even a few milliseconds, another new leader will be immediately elected and that takes over that task.

Robert Blumen 00:30:02 And you’ve mentioned a couple of times the admission controller. I’m aware from the documentation that it is a instance of a Kubernetes object called a dynamic admission controller, and that’s not specific to Kyverno. Could you review what that controller is in general for Kubernetes and then we’ll come back to Kyverno?

Jim Bugwadia 00:30:23 Sure. So dynamic admission controllers are a way of extending Kubernetes. Kubernetes has a concept called custom resource definitions, which is extremely powerful, right? So you can, you can extend the API and have your own object declarations in open API V3 schema, dynamic admission controllers along that theme of extensibility, what they allow you to do is, after any API request is, so all API requests go to the API server anytime the API request hits the API server, it’s first authenticated and authorized. And after that phase of processing, there’s another phase called admission controls. Kubernetes has built in admission controls, which are part of the API server. So you can toggle these using flags, using arguments when you configure the API server. If you’re running your own Kubernetes, if you’re using a cloud provider or managed Kubernetes, you have to go through their configuration to toggle these.

Jim Bugwadia 00:31:28 But then there’s after the built-in admission control is applied, then Kubernetes applies dynamic admission controls, which is a call out to any external service or deployment, which can also get an admission request from the API server and can participate in either allowing or denying that request based on the payload and based on other configurations. So Kyverno, like you mentioned, is an example of a dynamic admission controller. It runs as its own workload outside of the API server and then gets these requests. So dynamic admission controllers, much like with anything in software, there’s always trade-offs, right? So they can, if they’re not configured correctly or if they end up taking too much latency, there could be challenges in scaling and managing the cluster correctly. So they have to be extremely performant, very fast, typically milliseconds in terms of responding. So Kyverno is highly tuned, highly optimized for that type of workload where it’ll cache everything in memory, make admission decisions very quickly. But it is possible to write policies in a manner like we were chatting about earlier, where if you end up making external API calls, you end up injecting latency, right? But going back to dynamic admission controllers, it’s an external service which the API server will call out to and delegate an admission decision to say, should I allow this API request to proceed or should I prevent it? And with some reason for why it was blocked.

Robert Blumen 00:33:09 The word in this case admission, it’s maybe a little bit quirky, but that means in effect, an API call to the Kubernetes API. Is that right?

Jim Bugwadia 00:33:19 That’s correct. And every change in Kubernetes, anytime you change any configuration, even if you generate an event in Kubernetes, it goes through the same process, uh, goes through the API server, it delegates, goes through all of these phases, even if you’re trying to exec into a pod or mount a file, all of that is subject to the same process.

Robert Blumen 00:33:41 And how are these dynamic emission controllers authorized?

Jim Bugwadia 00:33:45 Great question, right? So Kubernetes has something called token review, which is built in into it, right? So from a security perspective, you can use token review to know that this request is coming from a trusted source. You can, of course, when you’re configuring these admission controllers, you can also set up standard RBACK and this is where putting them in a namespace, which is secured, is extremely important. So what you want to avoid, and Kyverno by default avoids this is policies are not applied to the Kyverno namespace itself, right? And that obviously can be a security risk if the Kyverno namespace is not properly secured. So it becomes like a bootstrapping problem again, where you need that first route of trust, you need to make sure that every layer is properly secured. But then as you’re getting API requests, Kyverno can check and see that that request came from the proper source. And of course, when Kyverno registers, so it registers itself using something called web hook configuration. So there’s a validating web hook configuration and a mutating web hook configuration. And the secret that I mentioned that Kyverno manages, you could bring your own certificates, but if you don’t, Kyverno will itself generate a certificate. And that’s how the API server knows that Kyverno is trusted for admission requests as well.

Robert Blumen 00:35:12 So what level of authorization is needed to run the Helm chart that installs Kyverno?

Jim Bugwadia 00:35:19 You have to be an administrator, right? So you can’t be just a normal user. So these are cluster, much like with, again, a CNI or other kind of controllers, a cluster admin would need to install this. So you do need permissions to create custom resources within your cluster. You need permissions to change things like web book configurations, which impact significantly the cluster behaviors, right? So only admins can do this.

Robert Blumen 00:35:46 I’m building a cluster, I booted up then just like you said, I install Kyverno as the next thing after the control plane and the CNI, at what point do you install the policies that Kyverno is enforcing?

Jim Bugwadia 00:36:03 So that is right after you bring up Kyverno, the next thing you would want to do is roll out the policies. Usually if you’re using something like Argo CDO Flux, that would be the next workload. So you first want to make sure Kyverno itself is up and ready, and these tools will check and make sure the status of these controllers, says they’re healthy. And when Kyverno responds as healthy, you can start deploying policies. So you would do that as the next workload right after Kyverno.

Robert Blumen 00:36:34 We’ve gone through these steps, added some more workload that we want to run on Kubernetes, and later on down the road we want to upgrade just policies, but not necessarily Kyverno itself. Could you talk about upgrading policies and are policies themselves versioned so that it’s clear what version of any given policy I have running?

Jim Bugwadia 00:37:00 Yes. So you would want to version, and again, we think of this as policy as code. So much like you would with a software application or any other code you’re deploying, you want to manage your policies in Git or some other version-controlled system. You want to bundle them using package managers like Helm, and you want to deploy them either again through GitHubs or through OCI registries. So all of those best practices. And of course you want to unit test as well as end-to-end test those policies before they hit your production clusters, right? So all of that is extremely important. But then, the basic unit of anything being as code is to build in that versioning. And typically, rather than versioning each individual policy, you would want to version them as a policy set. So, and package that policy set as a Helm chart or some GIT repo, which then, a GitHubs controller will deploy.

Robert Blumen 00:38:03 Now, once you have Kyverno running, there is another type of failure mode or error that the Kubernetes developers can encounter, which is the thing they want to do, has been denied because it violates a policy. What kind of feedback error messages, logs, or how does a developer become aware that they have been denied access because they violated a policy, which policy? What exactly in the policy failed?

Jim Bugwadia 00:38:35 So several options here, and depending on the type of cluster, the environment and how you want to, and then even the organization, you can decide which one to use. One is of course, if the workload is blocked at admission controls, then there’s immediate feedback based on the deployment tool you’re using. Like again, a GitHubs controller, or if you’re just using kubectl, this Kubernetes CLI, you will see that the error or the reason why it was blocked, directly in the CLI. And all of this is customizable within the policy, right? So as you’re authoring policies, you can customize that message. You can even link to your internal like wiki page or knowledge base on remediation. In fact, solutions like Nirmata, which build on top of Kyverno give customizable remediation help and guidance, all of that built in so that’s one way is just you’re enforcing and blocking.

Jim Bugwadia 00:39:36 Now for workloads which are already deployed, because imagine you already have a production cluster, you’re adopting Kyverno and now you’re rolling out policies, you want to give feedback to the existing workload owners as well. So Kyverno beyond admission controls will run routine background scans on every workload will apply into the policies. And that data is collected in another resource in Kubernetes, which is a policy report. So it shows, and this is very useful for compliance as well, because you can tell what workloads passed, what they failed, and it gives you an accurate information of all the policies that were applied to the workload and the violations that were produced as well as which workloads are compliant. So now a higher-level tool can, again, collect that periodically across all your clusters can aggregate that and show these in dashboards, or you can kind of build your own dashboards.

Jim Bugwadia 00:40:34 Or if you’re using a just a one or two, a smaller environment with a few clusters, you can use kubectl and Kubernetes APIs for this. But that policy report, one interesting thing is it’s not just limited to Kyverno because what we did is we spun out that policy report, and as you mentioned I co-chair in the policy working group in Kubernetes. So what we were looking at is what can we standardize across different policy engines and scanners and various tools for security and operations and compliance? And one idea was why not standardize on the reporting format? So anything that wants to report anything of interest in Kubernetes, you can use this policy report format to report that. And Kyverno does the same. And in fact, there’s a sub project within Kyverno called Policy Reporter, which can take things from Kyverno as well as other scanners, like it integrates with Trivy for vulnerability scanning, it integrates with Falco for runtime, and it’ll show you all of these reports in that standard format across all of these tools for your cluster.

Robert Blumen 00:41:42 If you are developing on Kubernetes, and you have a good understanding of what some of the policies are, of course you’re not going to intentionally design service that will violate policies. But can you think of an experience you had or someone you’re aware of where they tried to do something and it was blocked and that wasn’t what they were expecting and they learned something a little bit unexpected about the policies that were running?

Jim Bugwadia 00:42:10 Kubernetes is of course, constantly evolving, right? And there’s always interesting things happening within the space, within the ecosystem. A lot of this also depends on what you install within Kubernetes as other controllers, right? Whether it’s for service mesh or if you’re running Argo CD in Kubernetes you might need policies for that. So the interesting thing about the community is there’s always new policies flowing in. There’s always new findings. Like just recently there was a, something published by the security, a company Viz, where they talked about exploit that they published and they documented where they were able to use Istio to be able to take advantage of another setting, a configuration setting in a Kubernetes pod, which allows a pod one container to share the network namespace of another container. And then what they were able to do is, configure their role to match the Istio container role, and then they suddenly got visibility into everything that Istio can see.

Jim Bugwadia 00:43:19 So things like that, which are again, this is a new finding you can very easily craft a Kyverno policy for, and if you deploy it on your clusters, now of course you, if some, unless somebody is maliciously using this exploit, you would not expect anybody to be running as the Istio user within a regular container. But things like that would be in that category of new findings. Other things are Kubernetes as popular as it is, it’s a very large surface area for a system, right? So not everybody knows everything. And as this developer, look, I might understand how to build a docker or a container image or a pod man image, but beyond that, I don’t know about all these settings. Like even why should I care what a security context is, right? So unless somebody explains this to me, so as we see developers in their Kubernetes journey, there are constantly these type of learnings to say, oh, okay maybe I have this share process namespace, and I need to set this to false.

Jim Bugwadia 00:44:25 And somebody needs to explain why does this need to be false and or why is it not? Why is it not set by default? So with Kyverno, one other interesting thing you could do is the security and ops team can set it defaults by default. So for a security default, and then the workload owner, if they happen to set it to true for whatever reason, it would, their workload would be denied. But they can configure, they can create another Kyverno resource called the policy exception. So they can say, I need that exception, and here’s why. And then the security team can sign off on it. And I mean, like literally sign off using a digital signature, right? They can approve it and then that workload is allowed. So you could kind of automate that whole workflow in a manner which is conducive to DevOps best practices, as well as doesn’t block developers and keeps them informed every step of the way.

Robert Blumen 00:45:21 I’m glad you mentioned that because I was going to ask about exceptions, but I’ll consider that matter to be addressed. Now, this is not specifically a Kyverno question, but I’m aware of a common thing that happens where you run a security tool and you get a report back, which contains thousands of violations. People feel totally deflated, they look at that. So there’s no way, given our workload and the amount of people we have, we’re ever going to address this. And so nothing gets done. So my question is, are you aware of groups you’ve seen who have deployed Kyverno, they gotten this report and they’ve burned it down to zero and then kept it green?

Jim Bugwadia 00:46:05 Yes. So there are it’s few, but they do exist , and it is possible, right? It takes work, it takes effort. And again, the power of Kyverno and how it’s structured in Kubernetes, along with some of the other tooling, the flexible reporting, the exceptions is that a lot of the problem we see with that thousands of finding is if those findings are only visible to a few people, like the security team in a security tool, which is only accessible to them, it’s not going to help the rest of the organization, right? So you really want to democratize this and bring it into tools that developers can see as early as possible in their application lifecycle and the platform teams can see. So multiple roles can see, and Kubernetes in many ways, the power of Kubernetes is its standardization as an API set, right?

Jim Bugwadia 00:47:06 So in Kubernetes is the first time in our industry, I believe that we have a common standard for describing workloads, operating workloads, and collecting information about workloads through this API standard. And it, it’s because it’s extensible and it’s brilliantly designed to be extensible at scale. And now we can do that with reporting so that the way to solve this and the way we’ve seen teams solve this is by applying the kind of adage of divide and conquer. You can’t have one team be responsible for all of this, right? Every security is a shared responsibility. You need to make sure that workload owners are aware of the best practices. And as a developer, if somebody is blocking my workload, I want to know why, right? So gimme the right information in my tool without me having to jump through hoops or without like reactive security would be somebody sees thousands of findings after something’s in production and now there’s no easy way to deal with this as an organization.

Robert Blumen 00:48:16 We have an episode in our upcoming that not published by the time this one, on the process of production readiness, I could see that being policy compliant should be incorporated into organization’s definition of production readiness. What’s your view on that?

Jim Bugwadia 00:48:36 That’s absolutely correct, right? And, and what’s very interesting, and as you’ve probably seen this trend within the community, especially in the cloud native community, is this trend from DevOps to DevSecOps to now platform engineering, right? And if you think about what platform engineering is all about is treating the platform and these platforms are typically built on Kubernetes as an end product itself, and then offering what’s known as golden paths to developers. So the idea is to get to make sort of codify what it takes to get to production readiness and make that very visible or make folks very aware as early as possible. So like with Kyverno policies, not only do they apply as admission controls and as background scans in clusters, you can apply this in your CI pipeline, right? So you can scan Kubernetes, manifest even before they’re deployed to any cluster, get the results and make developers aware to say, hey, here’s the best practices we as an organizations require. Here’s the policy compliance we require. So these are things and you can show them the remediations. And of course, again, higher level solutions like Nirmata does this across, know clusters, pipelines, and even cloud services. Because Kyverno, it started in Kubernetes, but it expanded beyond Kubernetes and can now scan any JSON or any kind of workload regardless of where it’s running.

Robert Blumen 00:50:09 I now realize, I wish I’d ask you this a little bit a while back when we were talking about bootstrapping, but us this, now you can make up some numbers for the purpose of this example, but pick your cluster size. How much resources does Kyverno need for its services to run for some size cluster that you’ll describe?

Jim Bugwadia 00:50:32 Yeah, so typically what we’ve seen, and clusters vary a lot across organizations, right? We have worked with some customers which have huge clusters with like over 5,000 nodes, others which, who have hundreds of clusters, but each cluster is like 10 to 20 nodes, right? What matters to Kyverno though is how much activity is in those clusters. Because if you think about it, once a resource is configured, it’s configured, it’s static, yes, there’s some overhead for background scanning, but the pressure during admission controls is how many admission requests per second you are getting, right? So the way we kind of measure, Kyverno scalability is through that unit, ARPS admission requests per second. And typically we have size Kyverno, so we are in the process of putting in a horizontal pod autoscaler to for the admission controller. And that’s a best practice to follow for production.

Jim Bugwadia 00:51:30 But it’s usually, it starts at around, I think about 5,200 meg is more than sufficient. So memory is not the constraint, it’s CPU bound because processing large JSON payloads takes CPU, right? So, Kyverno tends to be more CPU bound. So typically if you’re running in any production workload, we would say, about a hundred meg in terms of memory running three instances, a hundred meg each, and then having at least two CPUs per, or so allocated for instance. And then with some scaling, right? So you could start much lower, but then allowing it and upper bound off that is a good size for like a mid-size production workload would be more than sufficient.

Robert Blumen 00:52:16 I wanted to talk about the observability of the Kyverno itself. Does it integrate with all of the standard of whatever you might be using for logging, metrics, traces, and anything else?

Jim Bugwadia 00:52:30 Open telemetry is the standard for cloud native workloads. So yes, Kyverno fully supports open telemetry for metrics for logging, for tracing, even for spans, right? So you can see exactly how much time is spent between the API server and Kyverno, and then Kyverno and any other pro services. You’re calling one commonly called the services, the OCI registry, which is used not just for images, but also artifacts, like signatures to say, is your image signed? Was it signed by the correct CICD workflow? Like your correct GitHub workflow, are they attestations like a scanned report and SBOM other things attached to your images. So all of that you can check with policies, but those require calls to the OCI registry, which does introduce some potential latency in the overall admission process. But yes, open telemetry is integrated into Kyverno.

Robert Blumen 00:53:29 When you deploy Kyverno with a Helm chart, does that come with any dashboards?

Jim Bugwadia 00:53:35 Not by itself, right? So you can, there’s a sub-project called Policy Reporter, which you can install separately, and that gives you some in cluster dashboards. There is a Grafana dashboard, which is another sub project. So if you’re running tools like Grafana and Prometheus, you can, which most cloud native deployments will do, you can install that dashboard and get some Kyverno metrics. But Kyverno itself reports the metrics and is enabled for it, but doesn’t come with dashboards. With the basic Helm chart itself.

Robert Blumen 00:54:08 If you’re set out to build a dashboard, what are one or two or three metrics that you really want to see if you’re going to look at one dashboard?

Jim Bugwadia 00:54:18 So all of the basics of Kubernetes best practice monitoring, right? So the, your pod health, your deployment health, a number of replicas, all of that is extremely essential, right? And that applies to any critical workload, including Kyverno. But in addition, I would measure like the admission request per second and the policy rule execution latencies, which Kyverno is instrumented to report. Because what you want to make sure is that no rule is taking more than at the maximum it should be a few seconds. Ideally, it’s under like about a hundred to 200 milliseconds in terms of execution time.

Robert Blumen 00:54:57 Great. Now, you mentioned earlier there is at least one other tool in this space, the open policy agent, which is, uses a different language to configure the policies. Are there any other key points of comparison between Kyverno and open policy agent?

Jim Bugwadia 00:55:14 Yeah, so there were different philosophies, different approaches. So myself, like I mentioned, I come from an operations background more than a security background, right? So as well as a lot of my team at Nirmata and then of course as we grew the project and built the project. So interestingly, Kyverno was first developed as a component in Nirmata, wasn’t called Kyverno at that time. And then we spun it out as an open-source project. So as we built Kyverno, our focus was operations as well as security, right? So SecOps rather than just purely security. So the approach we took is Kyverno, from the very beginning was designed not just to validate, enforce and block invalid configurations or insecure configurations, but also to mutate and generate configurations, right? So, which we believe is extremely important and critical to really do end to end and proper policy management.

Jim Bugwadia 00:56:15 So generating secure defaults in real time in cluster is essential for Kubernetes. Like the namespace example I gave earlier, anytime you create a new namespace for whatever reason, you want to generate things like fine-grained roles, role bindings, network policies, quotas, other artifacts. If you’re using Istio, maybe an Istio policy or some other CNI policy, all of that needs to be automatically generated. Things like if you’re deploying a workload, you might want to generate a VPA recommender configuration to observe that workload and fine tune the resources for it, right? So that was one of the key features in Kyverno, which is extremely unique to it. And then things like reporting through CRDs, custom resources which become part of the Kubernetes API exception management through the Kubernetes API, all of those are major differentiators in Kyverno.

Robert Blumen 00:57:15 You mentioned a couple of times Kyverno, it is an open-source project. What else are you doing at Nirmata besides contributing quite a lot to the Kyverno project?

Jim Bugwadia 00:57:27 Yeah, so lots of interesting things, and open-source of course, is a lot of fun. It’s very exciting to work with the community and there’s this sort of symbiotic relationship between open-source projects as well as the companies that back the open-source project and then sponsor them. So for us, the approach we took is we want Kyverno to be very full featured, very complete, and something that it gives almost instant value to end users, right? So that’s extremely important to us, and we don’t intend to cripple Kyverno in any manner, just to kind of offer commercial solutions which unlock critical things for production. That’s not the approach we took. Instead, the way we think about it, and the analogy that myself and my co-founders at Nirmata often use, we think of what Nirmata is to Kyverno as what something like GitHub or GitLab is to Git.

Jim Bugwadia 00:58:25 So all developers understand Git commands. It’s not very hard. It’s actually pretty easy for any organization to run their own Git server. You can run it as a Helm chart or as a pod or things extra in a very simple manner. But the value tools like GitLab or GitHub provide is to be allowing teams to collaborate on top of Git is to provide things like audit trails and other information. So if you want teams to really leverage policy as code, we believe Nirmata becomes essential. Much like GitHub becomes essential for a GIT implementation. And again, beyond like this debt. So what Nirmata provides is collaboration, workflows, developers can see remediations, which are instrumented by your security teams. Security teams can see reports, the ops teams can manage of course policy deployments. So all of that, it becomes that hub for policy as code across your fleet of clusters for reporting and collection.

Jim Bugwadia 00:59:29 Whereas each cluster, you can get these reports to Kubernetes APIs, Nirmata does the deduplication, the aggregation, the enrichment assignment, again to the right owners. It’s a lot of value there, even just from the reporting perspective. And then finally if Kyverno is managing your policies and enforcing these policies across your pipelines and clusters, how do you know Kyverno actually is running and somebody hasn’t misconfigured it, right? So Nirmata also manages that across your fleet, both pipelines, clusters, and other services to make sure that policies haven’t been tampered with. The right versions of policies are deployed on each clusters. And then in addition, you also get compliance standards. So going back to what we talked about, if you want PCI compliance or HIPAA compliance, or you have your own custom standard, Nirmata provides that across your fleet of clusters and workloads.

Robert Blumen 01:00:26 Jim, I think we’ve had a very good coverage of policy as code and Kyverno. If listeners would like to find or follow you, is there anywhere you’d like to direct them?

Jim Bugwadia 01:00:36 Sure. I’m pretty easy to find on most social media sites, LinkedIn, as well as, X or Twitter. Of course, if you’re in the CNCF communities, I hang out in some of the various working groups as well as the Kyverno Slack channel in the Kubernetes workspace, as well as the CNCF workspace.

Robert Blumen 01:00:55 Jim, thank you for speaking to Software Engineering Radio.

Jim Bugwadia 01:00:59 Thanks for having me, Robert. My pleasure.

Robert Blumen 01:01:01 This is Robert Blumen, and thank you for listening.

[End of Audio]

SE Radio 634: Jim Bugwadia on Kubernetes Policy as Code

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 675: Brian Demers on Observability into the Toolchain

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

Menu

Recent posts

Search

Search

SE Radio 634: Jim Bugwadia on Kubernetes Policy as Code

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 675: Brian Demers on Observability into the Toolchain

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

Menu

Recent posts