Search
Stevie Caldwell - SE Radio Guest

SE Radio 635: Stevie Caldwell on Zero-Trust Architecture

Stevie Caldwell, Senior Engineering Technical Lead at Fairwinds, joins host Priyanka Raghavan to discuss zero-trust network reference architecture. The episode begins with high-level definitions of zero-trust architecture, zero-trust reference architecture, and the pillars of Zero Trust. Stevie describes four open-source implementations of the Zero Trust Reference Architecture: Emissary Ingress, Cert Manager, LinkerD, and the Policy Engine Polaris. Each component is explored to help clarify their roles in the Zero Trust journey. The episode concludes with a look at the future direction of Zero Trust Network Architecture.

This episode is sponsored by QA Wolf.
QA Wolf logo



Show Notes

SE Radio Episodes


Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Priyanka Raghavan 00:00:51 Hi everyone, I’m Priyanka Raghavan for Software Engineering Radio, and today I’m chatting with Stevie Caldwell, a senior engineering tech lead at Fairwinds. She has a lot of experience in research development, architecture, design audits, as well as client support and incident analysis. To top this, Stevie has a wealth of knowledge in areas of DevOps, Kubernetes, and Cloud infrastructure. Today we’re going to be talking about zero-trust network architecture, specifically diving deep into a reference architecture for Kubernetes. Welcome to the show, Stevie.

Stevie Caldwell 00:01:26 Thank you. Thank you for having me. It’s great to be here, and I’m psyched to talk to you today.

Priyanka Raghavan 00:01:30 So the first question I wanted to ask you is trust and security at the core of computing. And so in this regard, would you be able to explain to us or define the term zero-trust network architecture?

Stevie Caldwell 00:01:43 Yeah, it’s often useful to define it in terms of what was, or what might be even still standard now, which is a more perimeter-based approach to security also has been called castle approach. People have talked about castle-and-moat, and essentially it’s that you’re trusting anything, you’re setting up a perimeter of security that says anything outside my cluster or outside my network is to be looked upon with skepticism is not to be trusted and anything, but once you’re inside the network, you’re cool. Sort of defining, using the network itself as the identity versus with zero-trust. The challenge is that trust, no ones like the x Files. So you want to treat even things that are inside your perimeter, inside your network with skepticism, with care. You want to remove that implicit trust and make it explicit so that you’re being meaningful and deliberate about what things you allow to communicate with each other inside your network.

Stevie Caldwell 00:02:51 I like to use an analogy. One that I think I like a lot is like an apartment building where you have an apartment building, you have a front door that faces the public, that people are given a key to if they live in that building. So they get a key so that they’re allowed to enter that building once they’re inside the building. You don’t just leave all the apartment doors open still, right? You don’t just allow people and as well, you’re in the building now, so you can go wherever you want. You still have like network; you still have security at each of like the apartments because those are locked. So I like to think about the zero-trust sort of working that same way.

Priyanka Raghavan 00:03:26 That’s great. So one of the books I was reading before preparing for the show was the zero-trust networks book. We had the authors of that book on the show about four years back, and they talked about some fundamental principles of zero-trust, I think pretty much similar to what you’re talking about, like the concept of trusting no one depending a lot on segmentation, following principles of least privileges, and then of course monitoring. Is that something that you can elaborate a little bit about?

Stevie Caldwell 00:04:00 Yeah, so there is this framework around zero-trust, where there are these pillars that sort of group the domains that you would commonly want to secure in a zero-trust implementation. So, it’s identity which deals with like your users, so who’s accessing your system, what are they allowed to access, even down to like physical access from a user. Like can you swipe into a data center? There’s application and workloads, which deals with making sure that your applications and workloads are also vigilant about who they talk to. An example of this is like workload security inside a Kubernetes cluster, right? So making sure that only the applications that need access to a resource have that access, not letting everything right to an S3 bucket for example. There’s network security, which is where a lot of people focus honestly, when they start thinking about zero-trust, that’s micro segmentation, that’s isolating.

Stevie Caldwell 00:05:01 There’s sensitive resources on the networks moving away from that perimeter only approach to network security. There’s data security, so isolating your sensitive data, encryption in transit and at rest. There’s device security, which is about your devices, your laptops, your phones, and then across all those are three additional sort of, there’s sort of pillars, but they’re kind of cross-cutting because there’s the observability and monitoring piece where you want to be able to see that all these things in action, you want to be able to log user access to something or network traffic. There’s automation or orchestration so that you’re actually taking some of the human error element out of your network, out of your zero-trust security solution. And then there’s a governance piece where you want to have policies in place that people follow and that systems follow, and they have ways of enforcing those policies as well.

Priyanka Raghavan 00:06:08 Okay, that’s great. So the next question I wanted to ask you is about the term reference architecture, which is used, there seems to be multiple approaches. Could you explain the term and then your thoughts on these multiple approaches?

Stevie Caldwell 00:06:22 Yeah. So reference architecture is a template, is a way to draw out solutions to solve a particular problem. It makes it easier to implement your solution, provides a consistent solution across different domains so you’re not reinventing the wheel, right? So if this app team needs to do a thing, if you have a reference architecture that’s already been built up, they have the ability to just look at that and implement what’s there versus going out and starting from scratch. Interesting, because I said I’m a rock star and I’m not, obviously, but I do make music in my own time. And one of the things that’s important when you’re like mixing a track is using a reference track, and its sort of the same idea. When I was reading about this, I was like, oh this feels very familiar to me because it’s the same idea. It’s something that someone else has already done that you can follow along with, to implement your own thing without having to start all over again. And they can be very detailed, or they can be high level, really depends on the domain that you’re trying to solve for. But at the basics it should probably contain at least like information about what you’re solving, and then what the purpose of the design is so that people are able to more readily determine if it’s useful to them or not.

Priyanka Raghavan 00:07:44 That’s great. And I think the other question I wanted to ask, which I think you alluded to in the first answer when I asked you about zero-trust network architecture, is why should we care about a zero-trust reference architecture in the Cloud, basically for Cloud native solutions? Why is this important?

Stevie Caldwell 00:08:03 I think it’s very much because in the Cloud you don’t have the same level of control that you have outside the Cloud, right? So if you’re running your own data center, you control the hardware, the servers that it runs on, you control the networking equipment to some degree, you’re able to set up the access to the cage, to the data center. You just have more oversight and insight into what’s happening in fact, but you don’t own the things in the Cloud. There’s more sprawl, there’s no physical boundaries. Your workloads can be spread across multiple regions, multiple Clouds. It’s harder to know who’s accessing your apps and data, how they’re accessing it. And when you try to secure all these different aspects, you can often come up with like a kind of hodgepodge of solutions that become really difficult to manage. And the more complex and difficult to manage your solutions are, the easier it is for them to like, not work, not be configured correctly, and then expose you to risk. So it’s a unified means of controlling access within the domain and zero-trust is a good way to do that in a Cloud environment.

Priyanka Raghavan 00:09:22 I think that makes a lot of sense right now, the way you’ve answered it, so you’re running workloads on an infrastructure where you have no control over. So as a result it really makes some sense that you implement this zero-trust reference architecture. So, just to kind of ask you at a very high level before we dive deep, is what are the main components of zero-trust network architecture for Kubernetes? That’s something that you can detail for us.

Stevie Caldwell 00:09:51 So for Kubernetes cluster, I would say some of the main reference, some of the main points you’d want to hit in reference architecture would be ingress. So, how the traffic is getting into your cluster, what’s allowed in, where it’s allowed to go once it’s in the cluster. So, what services your ingress is allowed to forward traffic to. And then maintaining identity and security, so encryption and authenticating the identity of the parts that are taking place in your workload communication, using something like certain manager, certainly other solutions as well. But that is a piece that I feel like should be addressed in your reference architecture the service mesh piece. So that is what is most commonly used for securing communications between workloads. So for doing that encryption in transit and for verifying the identities of those components and just defining what internal components can talk to each other. And then beyond that, what components can access what resources that might actually live outside your clusters. So what components are allowed to access your RDS databases, your S3 buckets, what components are allowed to talk across your VPC to something else. Like it can get pretty large, which is why it’s important to, I think, split them up into domains. Right? So, but with the Kubernetes cluster, I think those are your main things. Ingress, workload, communication, encryption, data security.

Priyanka Raghavan 00:11:27 Okay. So I think it’s a good segue to get into like the details right now. So when we did this episode on zero-trust networks, the guest there, one of the approaches that he suggested on starting was trying to figure out what your most important assets are and then start going outwards instead of like trying to first protect the parameter and going, you know the inward approach, you said, start with your assets and then start going outwards, which I found very interesting when I was listening to that episode. And I just thought I’ll ask you about your thoughts on that before diving deep into the pillars that we just discussed.

Stevie Caldwell 00:12:08 Yeah, I think that that makes total sense. I think starting with the most critical data, defining your attack surface helps you to focus efforts, not get overwhelmed, trying to implement zero-trust everywhere at once, because that’s a recipe for complexity. And again, as we said, complexity can lead to misconfigured systems. So determine what your sensitive data is, what are your critical applications, and start there. I think that’s a good way to go about it.

Priyanka Raghavan 00:12:38 Okay. So I think we can probably now go into like the different concepts. And the book that I was looking at was the zero-trust reference architecture for Kubernetes which you pointed me to, which had talked about these four open-source projects. One is the emissary ingress, LinderD, Cert Manager and Polaris. So I thought we could start with say the first part, which is the emissary ingress, because we talked a lot about what comes into the network. But before I go into that, is there something that when you start doing this different thing, is there one something that we need to do in terms of the environment? Do we need to bootstrap it so that all of these different components trust each other in the zero-trust? Is there something that ties this all together?

Stevie Caldwell 00:13:26 If you’re installing these different components in your cluster in general, if you install everything at once, the sort of default, I think is to allow everything. So there is no implicit deny in effect. So you can install emissary ingress and set up your host and your mappings and get traffic from ingress to your services without having to set anything up. The thing that will determine that trust is going to be the service mesh, which is LinderD in our service, in our reference architecture. And LinderD by default, will not deny traffic. So you can inject that sidecar proxy that it uses, which we’ll I’m sure talk about later into any workload. And it won’t cause any problems. It’s not a denied by default, so you have to explicitly go in and start putting in these parameters that will restrict traffic.

Priyanka Raghavan 00:14:29 But I was wondering in terms of like each of these separate components, is there anything that we need to sort of like bootstrap the environment before we start, is there anything else that we should keep track of? Or do we just sort of install each of these components, which will, let me talk about and then like, how do they trust each other?

Stevie Caldwell 00:14:50 Well, they trust each other automatically because that’s sort of the default, okay. In the Kubernetes cluster. Okay.

Priyanka Raghavan 00:14:55 Yeah. Okay.

Stevie Caldwell 00:14:55 Okay. So you install everything and Kubernetes by default does not have a ton of much security.

Priyanka Raghavan 00:15:03 Okay.

Stevie Caldwell 00:15:04 Right out of the box. So you install those things, they talk to each other.

Priyanka Raghavan 00:15:08 Okay. So then let’s just then deep dive into each of these components. So what is emissary ingress and how does it tie in with the zero-trust principles that we just talked about? Just monitoring your traffic, which coming into your network, how should one think about the parameter and encryption and things like that?

Stevie Caldwell 00:15:30 So I hope I do, if anyone from emissary or from Ambassador hears this, I hope I do your products justice. So emissary ingress, first of all it’s an ingress. It’s an alternative to using the built-in ingress objects that are already enabled in the Kubernetes API. And one of the cool things about emissary is that it decouples the aspects of north-south routing. So you can lock down access to those things individually, which is nice because when you don’t have those things decoupled, when it’s just one object that anyone in the cluster with access to the object can configure, then it makes it pretty easy for someone to mistakenly expose something in a way they didn’t want to introduce some sort of security issue or vulnerability. So in terms of what to think about with ingress, when you’re talking about perimeter, I think the basic things are determining what you want to do with encryption.

Stevie Caldwell 00:16:35 So, traffic comes into your cluster, are folks allowed to enter your cluster using unencrypted traffic, or do you want to force redirection to encryption? Is the request coming from a client, do you have some sort of workload or service that you need to authenticate against in order to be able to use it? And if it is coming from a client, like figuring out how to determine whether or not to accept it, so you can use authentication to determine if that request is coming from an allowed source, you can rate limit to help mitigate potential abuse. Another question you want might want to set up is just generally should you, are there requests that you just should not allow? So are there IPs, paths or something that you want to drop and don’t want to allow into the cluster at all? Or maybe they’re private, so they exist, but you don’t want people to be able to hit them. Those are the kind of things you should think about when you’re looking at configuring your perimeter specifically via like an emissary ingress or some other ingress.

Priyanka Raghavan 00:17:39 Okay. I think the other thing is, how do you define host names and secure it? I’m assuming as an attacker, this would be one thing that they’re constantly looking for. So can you just talk a little bit about how that’s done with emissary ingress?

Stevie Caldwell 00:17:53 So if I understand the question, so emissary ingress uses, there are a number of CRDs that get installed in your cluster that allow you to define the various pieces of emissary ingress. And one of those is, a host object. And within the host object, you define the host names that emissary is going to listen on so that that will be accessible from outside your network. And I was talking about the decoupled nature. So the host is its own separate object as opposed to ingress, which puts the host in the ingress objects that sits alongside your actual workload in that namespace. So the host object itself can be locked down in terms of configuring, it can be locked down in using RBAC so that only certain people can access it, can edit it, can configure it, which already creates like a nice layer of security there. Just being able to restrict who has the ability to change that object. And then, given your devs will create their mapping resources that attach to that host and allow that traffic to go back to the backend. And then other than that, you’re also going to create, well, you should create a TLS cert that you’re going to attach to your ingress and that’s going to terminate TLS there. So that encryption piece is another way of like securing your host, I guess.

Priyanka Raghavan 00:19:27 Okay. I guess the, so this is the part where you, when you have the certificate, of course that takes care of your authentication bit as well, right? All the incoming requests?

Stevie Caldwell 00:19:38 It takes care of, well, on the incoming requests to the cluster, no, because that’s the standard TLS stuff. Where it’s just unidirectional, right? So unless the client has set up mutual TLS, which generally they do not, then it’s just a matter of verifying identity of the host itself to the client. The host doesn’t have any verification there.

Priyanka Raghavan 00:19:59 Okay. So I think now that we are talking a little bit about certificates, I think it’s a good time to talk a little bit about the other aspect, which is the Cert Manager. So this is used to manage the trust in our reference architecture. So can you talk a little bit about the Cert Manager with maybe some information on all the parties involved?

Stevie Caldwell 00:20:19 So Cert Manager is, it’s a solution that generates certificates for you. So Cert Manager works with issuers so that are external to your cluster, although you can’t also do self-signed, but you wouldn’t really want to do that in production. And so it works with these external issuers and essentially handles a lifecycle of certificates in your cluster. So it’s using shims, you can request certificates for your workloads and rotate them or renew them rather. I think the default is the certificates are valid for 90 days and then 30 days before they expire. So Certificate Manager will attempt to renew it for you. And so that enables your standard north- south security via ingress. And then it also can be used in conjunction with LinkerD to help provide the glue between the east west security with the LinkerD certs by, I believe it’s used to provision the trust anchor itself that LinkerD uses for signing.

Priyanka Raghavan 00:21:28 Yeah, I guess. Yeah, I think that makes I think the, right now this, we need to also secure the east-west as much as the north-south.

Stevie Caldwell 00:21:35 Yeah, that’s the purpose of the service mesh is for that East-West TLS configuration.

Priyanka Raghavan 00:21:41 Okay. So you talk a little bit about also the certificate, a lifecycle right in the Cert Manager. And that one is a, it’s a massive pain for people who are managing certificates. Can you talk a little bit about how do you automate trust? Is that something that’s also provided out of the box?

Stevie Caldwell 00:21:59 So there is, Cert Manager does have, I think another, another component that’s called the Trust Manager. I’m not as familiar with that. I think that is, and I think that comes into play specifically with being able to rotate the CA cert that LinkerD installs. So it’s getting a little bit into like the LinkerD architecture, but at its core, I think LinkerD when you install it, has its own internal CA and you can essentially use Cert Manager and you can use Cert Manager and the Trust Manager to manage that CA for you so that you don’t have to manually create those key pairs and, and save those off somewhere. Cert Manager takes care of that for you. And when your CA is due to need to be rotated, Cert Manager via the Trust Manager, I think takes care of that for you.

Priyanka Raghavan 00:22:56 Okay. I’ll add a note to the reference architecture. So that’s, perhaps the listeners could actually dive deep into that. But the question I wanted to ask is also in terms of these trusted authorities, so these must be the same, are there any like trusted authority? Can you talk about that in the Cert Manager? Is that something that, do we have typical issuers that the Cert Manager communicates with?

Stevie Caldwell 00:23:20 Yeah, so there’s a long list actually, that you can look at on the Cert Manager website. Some of the more common ones are Let’s Encrypt, which is an ACME issuer. People also use HashiCorp Vault. I’ve also seen people use CloudFlare in their clusters.

Priyanka Raghavan 00:23:40 The next thing I want to know is also this third manager seems to have a lot of these third-party dependencies. Could this be an attack vector? Because I guess if the Cert Manager goes down, then the trust is going to be severely affected, right? So how does one combat against that?

Stevie Caldwell 00:23:57 So I think yes, Cert Manager does rely on the issuers, right? That that’s how requests certificates and requests renewals, that’s part of that lifecycle management bit, right? So your ingress or service has some sort of annotation that a certain manager knows. And so when it sees that pop up, it goes out and requests a certificate and does the whole verification bit, whether it’s via DNS record or via an http like a well-known configuration file or something like that. And then provisions that cert hands it off to creates a secret with that cert data in it and gives it to the workload. So in that, the only time it really needs to go outside the cluster and talk to a third party is during that initial certificate creation and during renewal. So I’ve actually seen situations where there’s been an issue with less encrypt.

Stevie Caldwell 00:24:58 It’s been very rare, but it has happened. But when you think about what Cert Manager is doing, it’s not constantly like running and updating or anything like that. Like, so once your workload gets a certificate, it has a certificate and it has it for 90 days. And like I said, there’s a 30-day window when a Cert Manager tries to renew that cert. So unless you have some humongous issue where Let’s Encrypt is going to be down for 30 days, you’re probably going to be, it’s not going to be a big deal. Like I don’t think there’s really a thing of Cert Manager going down and then affecting the trust model. Similarly, when we get into talking about LinkerD in that east-west, that east-west security Cert Manager again, really only manages the trust anchor. And the trust anchor is like a CA so it’s more long lived. And LinkerD actually takes care of issuing certificates for its own internal components without going off cluster. It uses its internal CA so that’s not going to be affected by any sort of third party being unavailable either. So I think there’s not much to worry about there.

Priyanka Raghavan 00:26:09 Okay. Yeah, I think I was actually more thinking because I think we had, there was this one case in 2011 or something about this company called DigiNote. I mean, I could get the wrong name, maybe not right. But that had, again, it was a certificate issuing company and I think they had a breach or something. Then essentially all the certificates that were given out were basically invalid, right? So then I was sort of thinking that worst case scenario, because now the Cert Managers like the central of our zero-trust. So if what would happen in that case is sort of the worst-case scenario, I was thinking.

Stevie Caldwell 00:26:42 Yeah, but that’s not specific to Cert Manager. It’s anything that uses any certificate authority.

Priyanka Raghavan 00:26:47 Okay. Now we can talk a little bit about LinkerD, which is the next open-source project. And that talks about the service meshes. How is this different from the other service meshes? We’ve done a bunch of shows on service meshes for the listeners. I think you can take a look at Episode 600, but the question I want to know from you, how is LinkerD different from the other service meshes that are out there?

Stevie Caldwell 00:27:21 I think one of the main differences that LinkerD likes to point out is that it’s written in Rust and that it uses its own custom-built proxy, not Envoy, which is a standard that you’ll find in a lot of ingress solutions. And so, I think the folks, LinkerD will tell you that it’s, that’s part of what makes it so fast. Also, that it’s super simple in its configuration and does a lot of stuff out of the box that enables you to just get going with at least basic configurations like mutual TLS. So, yeah, I think that’s probably the biggest difference.

Priyanka Raghavan 00:27:58 Okay. And we talked a little bit about checking access every time in zero-trust. How does that work with LinkerD? I think you talked about the east-west traffic being supported by MTLS. Can you talk a little bit about that?

Stevie Caldwell 00:28:11 Yeah, so when we talk about it, checking every access every time, it’s essentially tied into identity. So the Kubernetes service accounts are the base identity that’s used behind these certificates. So the LinkerD proxy agent, which is a sidecar that runs alongside your containers in your pod, it’s responsible for requesting the certificate and then verifying the certificate’s data and verifying the identity of the workload, submitting a certificate against the identity issuer, which is another component that LinkerD installs inside your cluster. So it’s constantly, when you’re doing mutual TLS, it’s not only encrypting the traffic, but it’s also using the CA that it creates to verify that the entity on the certificate really has permission to use that certificate.

Priyanka Raghavan 00:29:13 That really brings, that ties that trust angle a lot with this access pattern. When you’re talking a little bit about the access pattern, I also want to talk about the thing that you spoke a little bit before that usually in Kubernetes, most of the services are allowed to talk to each other. So what happens with LinkerD? Is there something that we have, is there a possibility of having a default deny? Or is that there in the configuration?

Stevie Caldwell 00:29:41 Yes, absolutely. So you can, I believe you can annotate a namespace with a deny, and then that will deny all traffic. And then you’ll have to go in explicitly say who’s allowed to talk to who.

Priyanka Raghavan 00:30:00 Okay. So then that follows our principles of leaves privileges now, but I’m assuming then it’s possible to add like a level of, permissions or some sort of an auto back on that. Okay. Is that something that . .

Stevie Caldwell 00:30:13 Yeah, there’s, I can’t remember the exact name of the object. It’s like MTLS authentication policy. I think there are three pieces that go along with that. There’s like a server piece that identifies the server that you want to access. There’s an MTLS authentication object that then sort of maps who’s allowed to talk to that server ports, they’re allowed to talk on. Yeah. So there are like other components you can deploy to your cluster in order to start controlling traffic between workloads and restrict workloads based on the service that’s going to, or port that’s trying to talk to. Also the path I think you can restrict, so you can say the service A can talk to service B, but it can only go, it can only talk to service B on a specific path and a specific port. So you can get very granular with it, I believe.

Priyanka Raghavan 00:31:07 Okay. So then that really then rings in the concept of least privileges with the LinkerD right? Because you can specify the path, the port, and then like you said, who’s allowed to talk to it. Yeah. So the authentication, because there’s a default deny. And I guess the other concept is now what if something bad happens to one of the name spaces? Or is it possible that you can lock something down?

Stevie Caldwell 00:31:34 Yeah. So I think that is that default deny policy that you can apply to namespace.

Priyanka Raghavan 00:31:39 Okay. So, when you’re monitoring and you see something’s not going well, you can actually go and sort of configure the LinkerD configuration to deny.

Stevie Caldwell 00:31:48 Yes, so you can either be specific and use one of those, like depending on how much of a panic you’re in, you can just go ahead and say nothing can talk to anything in this namespace, and that will solve that nothing will be able to talk to it. Or you can go in and change one of those objects that I was talking about earlier. The server, the MTLS authentication service is the other one I was trying to remember, and authorization policy, those three go together to put fine grained access permissions between workloads. So you can go and change those, or you can just shut off the lights and apply annotation to a namespace pretty quickly.

Priyanka Raghavan 00:32:28 Okay. I wanted to talk a little bit about identities also, right? What are the different types of identities that you would see in a reference architecture? So I guess if it’s not south, you’ll see user identities, of other things you can talk about?

Stevie Caldwell 00:32:39 Yeah. I mean, depending on what you have in your environment. So again, like what you need to provision, the sort of reference architecture you need to create, and the policies you need to create really depends on what your environment is like. So if you have devices where you have devices can be part of that. How they’re allowed to access your network, I feel like that is a component of identity. But I think in general, we’re talking specifically about, like you said, users and we’re talking about workloads. And so when we talk about users, we’re talking about controlling those with RBAC and using like a third, I don’t want to say a third party, but an external authentication service along with that. So IAM, is a very common way to, authenticate users to your environment, and then you use RBAC to do the authorization piece, like what are they allowed to do?

Stevie Caldwell 00:33:40 That’s one level of identity, and that also ties into workload identity. So that’s another factor. And that is what it sounds like. It’s essentially your workloads taking on having a persona. They have an identity that with it also has the ability to be authenticated outside the cluster using IAM again, and then also having RBAC policies that control what those workloads can do. So one of the things I mentioned earlier is because of the decoupled nature of emissary, your ingress isn’t just one object that sits in the same namespace as your workload. And then potentially your developers have complete access to configuring that however they want, creating whatever path they want, going to whatever service. So you can imagine if you have some sort of breach and something is in your network, it can alter an ingress and be like, okay, everybody in this is all open or whatever or create some opening for themselves. With the way the emissary does it, it creates its own, there’s a separate host object, so the host object can sit somewhere else.

Stevie Caldwell 00:34:54 And then we can use that parts of that identity piece to protect that host object and say that only people who belong to this group, the systems operator group or whatever, have access to that namespace, or within that namespace only this group has the ability to edit that host configuration. Or what we most likely do is even take that out of the realm of being necessarily just about specific people and roles, but tie that into our CICD environment and take that out and make it like a non-human identity that controls those things.

Priyanka Raghavan 00:35:33 So there are multiple identities that come into play. There’s the user identity, there’s workload identity, and then apart from that, you have the authentication service that you can apply on the host. And then apart from that, you can also have an authorization and certain rules which you can configure. And then of course, you’ve got all your ingress controls as well. So at the network layer, that is also there. So it’s almost like a very layered approach. So the identity you can slap on a lot, and then that ties in well with these privileges. So yeah, I think that’s quite, I think it answers my question and hopefully for the listeners as well.

Stevie Caldwell 00:36:11 Yeah. That’s what we call defense in depth.

Priyanka Raghavan 00:36:14 So I think now it would be a good time to talk a little bit about policy enforcement, which we talked about as one of the tenants of zero-trust networks. I think there was an NSA Hardening Guidelines for Kubernetes. And if I look at that, it’s huge. Itís a lot of stuff to do.

Stevie Caldwell 00:36:32 Yes.

Priyanka Raghavan 00:36:37 So how do teams implement things like that?

Stevie Caldwell 00:36:49 Yes, I get it.

Priyanka Raghavan 00:36:52 It’s huge, but I was wondering if the whole concept of these, of Polaris and open- source projects that came out of the fact that this would be an easy way, like a cookbook to implement some of those guidelines?

Stevie Caldwell 00:37:07 Yeah. The NSA Hardening Guidelines are great, and they are super detailed and they outline a lot of this. This is my strong subject here since this is Polaris. We’re going to, well we haven’t said the name.

Priyanka Raghavan 00:37:24 Yeah, Polaris.

Stevie Caldwell 00:37:25 But Polaris, which we’re going to talk about in relation to policy is a Fairwinds project. And yeah, so those Hardening Guidelines are super detailed, very useful. They are, a lot of the guidelines that we at Fairwinds have followed before, this even became a thing like setting CP requests limits and things like that. In terms of how teams implement that, it’s hard because there’s a lot of material there. And teams would normally have to manually check for these things across, like all their workloads or systems, and then configure them. I figure out how to configure them and test and make sure it’s not going to break everything. And then it’s not a one-time thing. It has to be an ongoing process because every new application, every new workload that you deploy to your cluster has the ability to like violate one of those best practices.

Stevie Caldwell 00:38:27 Doing all that manually is a real pain. And I think oftentimes what you see is teams will go in with the intention of implementing these guidelines, hardening their systems. It takes a long time to do, and by the time they get to the end, they’re like, okay, we’re done. But by that time, a bunch of other workloads have been deployed to the cluster, and they rarely go back and start all over again. They rarely do the cycle. So implementing that is difficult without some help.

Priyanka Raghavan 00:39:04 Okay. So I guess for Polaris, which is the open-source policy engine from Fairwinds, what is it and why should one choose Polaris over there are a lot of other policy engines like OPA, Kyverno, maybe you could just break it down for someone like me.

Stevie Caldwell 00:39:24 So Polaris is an open policy engine, like I said that is open-source. Developed by Fairwinds and it comes with a bunch of pre-defined policies that are based off those NSA guidelines. Plus you have the ability to create your own. And it is a tool, it’s not like the tool, I’m not going to say it’s the only tool, right? Because as you mentioned, there are lots of other open-source, there are also other policy engines out there, but it is a tool that you could use when you ask how do teams implement those guidelines. This is a good way to do that, right? Because it’s sort of a three-tiered approach. You run it manually to determine what things are in violation of the policies that you want. So there’s a CLI component that you can run, or in a dashboard that you can look at.

Stevie Caldwell 00:40:15 You fix all those things up, and then in order to maintain adherence to those guidelines, you can run Polaris either in your CICD pipeline so that it blocks, shifts left and prevents anything from getting into your cluster in the first place. That would violate one of those guidelines, and you can run it as an admission controller, so it will reject, or at least warn about any workloads or objects in your cluster that violate those guidelines as well. So that is when we talk about how do teams implement those guidelines using something like that, like a policy engine is the way to go. Now, why Polaris over OPA or Kyverno? I mean, I’m biased , obviously, but I think that the pre-configured policies that Polaris comes with are really big deal because there’s a lot of stuff thatís just right out of the box makes sense, and again, is best practice because it’s based on those that NSA pardoning document. So it can make it easier and faster to get up and running with some basics, and then you can write your own policies, and those policies can be written using JSON schema, which is much easier to rock, in my opinion, than OPA because then you’re writing Rego policies and Rego policies can be, they can be a little difficult to get right.

Priyanka Raghavan 00:41:46 And there’s also this other concept here, which you call BYOC now, which is Bring Your Own Checks. Can you talk a little bit about that?

Stevie Caldwell 00:41:55 Yeah, so that’s more about the fact that you can write your own policies. So for example, when we talk in the context of the zero-trust reference architecture that we’ve been alluding to during this talk, there are objects that are not natively part of a Kubernetes cluster. And so the checks that we have in place don’t take those into consideration, right? It’d be impossible to write checks against every possible CRD that’s out there. So one of the things that you might want to do, for example, is you might want to check if you, if you’re using LinkerD, and you might want to check that every workload in your cluster is part of the service mesh, right? You don’t want something sitting outside of it. So you can write a policy in Polaris that checks for the existence of like the annotation that’s used to add a workload to the service mesh. You can check to make sure that every workload has a server object that, along with the MTLS authentication policy object et cetera. So you can tweak Polaris to check very specific things that are part of like the Kubernetes native API, which I think is super helpful.

Priyanka Raghavan 00:43:12 Okay. I also wanted to ask you in terms of if you’re able to point out like policy violations, but is there a way that any of these agents can also fix issues?

Stevie Caldwell 00:43:21 No, not at the moment. It is not reactive in that way. So it will print out the issue, it can print it the standard out, if you’re running the CLI, obviously the dashboard will show you and if you’re running the admission controller when it rejects your workload, it will print that out and send that out as well. It just reports on it. It’s non-intrusive.

Priyanka Raghavan 00:43:46 Okay. You talked a little bit about this dashboard, right, for viewing these violations. So does that come out of the box? So if you install Polaris, you’ll also get the dashboard?

Stevie Caldwell 00:43:58 Mm-Hmm, that’s correct.

Priyanka Raghavan 00:43:59 Okay. So that I guess, it gives you an overview of all the passing checks or the violations and things like that.

Stevie Caldwell 00:44:08 Yeah, it breaks it down by namespace, and so within each namespace it’ll show you the workload, and then under the workload it’ll show you which policies have been violated. You could set also severity of these policies as well. So that helps control whether or not a violation means you can’t deploy to the cluster at all, or whether it’s just going to give you like a heads up that that’s a thing. So it doesn’t have to be all breaking or anything like that.

Priyanka Raghavan 00:44:35 So I think we’ve covered a bit about Polaris and I think I’d like to wrap the show with some other questions that I have. Just a couple of questions. One is, are there any challenges that you have seen with real teams, real examples on implementing this reference architecture?

Stevie Caldwell 00:44:54 I think in general, it’s just the human element of being frustrated by restrictions, especially if you’re not used to them. So you have to really get buy-in from your teams, and you also have to balance what works for them in terms of their velocity and keeping your environment secure. So you don’t want to come in and like throw in a bunch of policies all of a sudden and then just be like, there you go, because that’s going to, that’s going to cause friction. And then people will always look for ways around the policies that you put in place. The communication piece is super important because you don’t want to slow down velocity and progress for your dev teams because there are a lot of roadblocks in their way.

Priyanka Raghavan 00:45:40 Okay. And what’s the future of zero-trust? What are the other new areas of development that you see in this reference architecture space for Kubernetes?

Stevie Caldwell 00:45:51 I mean, I really just see the continuing adoption and deeper integration across the existing pillars, right? So we’ve identified these pillars and I was talking about how you can implement something in your cluster and then think, yay, I’m done. But generally there’s a path, in fact, there’s a maturity model I think that has been released that talks about each level of maturity across all those pillars, right? So I think just helping people move up that maturity model, and that means like integrating zero-trust more deeply into each of those pillars using things like the automation piece, using things like the observability and analytics piece, I think is really going to be where the focus is going forward. So focusing on how to progress from the standard security implementation to the advanced one.

Priyanka Raghavan 00:46:51 Okay. So more adoption rather than new things coming across and across the maturity. Okay.

Stevie Caldwell 00:46:57 Exactly.

Priyanka Raghavan 00:46:59 And what about the piece on this automatic fixing and self-healing? What do you think about that? Like the ones where you talked about like the policy of violations. If it prints it out, but what do you think about automatic fixing? Is that something that should be done? Or maybe it could actually make things go bad?

Stevie Caldwell 00:47:21 It could go either way, but I think in general, I think there’s a push towards having some, just like Kubernetes itself, right? Having some self-healing components. So, setting things like and I’m going back to resources, right? If your policy is every workload has to have a CPU and memory request and limits set, then do you reject the workload because it doesn’t have it and have the message go back to the developer? I need to, you need to put that in there. Or do you have a default that says, if that’s missing, just put that in there. I think it depends. I think that it could be self-healing in that respect can be great depending on what it is you’re healing, right? So what it is, what the policy is, maybe not with resources, I think because resources are so variable and you don’t want to have something put in, like, there’s no way to really have a good baseline default resource template across all workloads, right? But you could have something default, like you’re going to set the user to non- route, right? Or you’re going to, gosh, I don’t know any number of other things you’re going to do LinkerD inject. You’re going to add that in annotation to the workloads, like it doesn’t have it, as opposed to rejecting it, just go ahead and putting it in there. Things like that I think are absolutely great. And I think those would be great adoptions to have.

Priyanka Raghavan 00:48:55 Okay. Thank you for this and thanks for coming on the show, Stevie. What is the best way people can reach you on the cyberspace?

Stevie Caldwell 00:49:05 Oh I’m on LinkedIn. I think it’s just Stevie Caldwell. I don’t think there’s a, there are actually a lot of us, but you’ll know me. Yeah, that’s pretty much the best way.

Priyanka Raghavan 00:49:15 Okay, so I’ll find you on LinkedIn and add it to the show notes. And just wanted to thank you for coming on the show and I think demystifying zero-trust network reference architecture. So thanks for this.

Stevie Caldwell 00:49:28 You’re welcome. Thank you for having me. It’s been a pleasure.

Priyanka Raghavan 00:49:31 This is Priyanka Raghavan for Software Engineering Radio. Thanks for listening.

[End of Audio]

Join the discussion

More from this show