SE Radio 619: James Strong on Kubernetes Networking

Infrastructure engineer and Kubernetes ingress-Nginx maintainer James Strong joins host Robert Blumen to discuss the Kubernetes networking layer. The discussion draws on content from Strong’s book on the topic and covers a lot of ground, including: the Kubernetes network’s use of different IP ranges than the host network; overlay network with its own IP ranges compared to using expanded portions of the host network ranges; adding routes with kernel extension points; programming kernel extension points with IP tables compared to eBPF; how routes are updated as the host network gains or loses nodes, the use of the Linux network namespace to isolate each pod; routing between pods on the same host; routing between pods across the host network; the container-network interface (CNI); the CNI ecosystem; differences between CNIs; choosing a CNI when running on a public cloud service; the Kubernetes service abstraction with a cluster-wide IP address; monitoring and telemetry of the Kubernetes network; and troubleshooting the Kubernetes network. Brought to you by IEEE Software magazine and IEEE Computer Society.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Robert Blumen 00:00:19 For Software Engineering Radio, this is Robert Blumen. I have with me today James Strong. James is a Cloud native and infrastructure engineer. He is a maintainer on the Kubernetes Ingress-NGINX project and is currently a Senior Solutions Architect at ISO Valent. James is the co-author with Valerie Lacey of the book Networking and Kubernetes: A Layered Approach published by O’Reilly. And that will be the subject of our conversation today. James, welcome to Software Engineering Radio.

James Strong 00:00:55 I am super excited to be here Robert, I love talking about networking. I love going as deep as possible. Itís also interesting to talk about networking without diagrams, so we’ll see how well we can describe things together.

Robert Blumen 00:01:09 Yeah, that is a common challenge that we face on software engineering radio. Unless there’s anything else you’d like to say about your background, we can dive right in.

James Strong 00:01:19 That’s a pretty good wrap up. I say that now and I, we’re going to talk a lot about software engineering and a lot of the times these days I’m talking to a lot of network engineers who are getting just into Kubernetes and Cloud native space. So a lot of those questions I’ve had the same. When I started, I thought I wanted to be a network engineer. I was on the path to try to get a CCIE and the Cloud just came around. I was like, this is very interesting. I went around working with that for a little while and then Kubernetes popped up as well to try to be this operating system for the data center. All those conversations, like the abstraction, all of that. And I just slowly migrated to doing the Kubernetes piece just because it was infrastructure, it was networking, it was software. It was a good cross section of all of those things. So yeah, that was a real quick overview of just, I’ve been through that transition and hopefully I can have an intelligent conversation about that. And I answered a lot of questions last week in KubeCon about how to go from being a network engineer to working on Kubernetes networking.

Robert Blumen 00:02:24 Your career path is relevant to our audience and our podcast. The podcast software engineering radio has been around for about 20 years. When it started, we really focused a lot on the writing code aspect of software engineering. But over time, the topics that we talk about have broadened to include many things in networking and security, which are now considered normal for software engineer to know about and use in their job. I think you have a great background to talk about this topic.

James Strong 00:02:56 I think everyone should understand how the network works. Now there’s a saying in networking, it’s guilty till proven innocent. So, we can skip that step and just have a conversation about performance issues or other things like that, that’d be great.

Robert Blumen 00:03:09 We’ll be talking about the Kubernetes network. We have some existing content on Kubernetes, in particular 445 on ePVF and 586 on VPCs and a bunch of other Kubernetes topics in the archive. But nothing specifically about the network. I’m going to say to introduce this, when I started looking at Kubernetes, it differs from other distributed systems. Many distributed systems you have some hosts, you have a network, you run the service on the hosts, and they use the host network. That is not how Kubernetes works. What is the big difference?

James Strong 00:03:49 I would like to challenge that assumption because I hear that a lot where again, having this conversation with folks who are just getting into Kubernetes, one of the big things that I’ve learned while going through this and troubleshooting and helping doing, standing up Kubernetes clusters and migrating applications to Kubernetes is that Linux networking. It’s still, if you have a Linux networking background, you understand how interfaces work, how IP addresses get signed, how routing works. You just need to learn how Kubernetes talks about it from that perspective. So you’re still setting up network interfaces. So there are abstractions and software that do it for you instead of having someone else do it. So you’re not putting in a ticket to get an IP address and get a CIS admin to set up a network interface. There’s software that does it for you. And one of those abstractions, and I’m sure we’ll talk about this more, is the container network interface. So Kubernetes relies on the routing of the underlying hosts as well as the software, the container network interface to set up the networking for us. Kubernetes doesn’t know about the underlying networking from that perspective. So it expects that nodes can talk to each other. So there are some underlying expectations and assumptions and then there’s also just, again, it is just Linux networking from that perspective.

Robert Blumen 00:05:12 I talk at a high level. I start out with my initial premise that Kubernetes is different than a standard distributed system and you’re pushing back on that. So maybe if you could bear with me a bit, I’m wrong, but why did I think that and explain why I’m wrong, but why is there something about that that is at least partially true or how is Kubernetes networking not got more layers to it than the flat distributed system that I described?

James Strong 00:05:45 I would say that I think we’re both right, is that it is still just Linux, but there are layers of abstraction on it. So when Kubernetes starts a pod and says, I need a pod, there are several things that happen, and we can walk through that and maybe go specifically into the networking side of that. So the lowest common denominator for a container or for an application is a pod. And a pod is a collection of containers. So you can have multiple containers running in there besides just your specific application. Well, when that starts up, it needs a control group and a namespace. Control groups are controlling for the resources of the container. So you’re looking at like CPU limits and things like that, memory limits, thatís all implemented in the control groups. And then namespace, namespaces allow us to have that abstraction where a pod looks like it is inside of its own, I hate to say it, virtual space.

James Strong 00:06:46 So you’ll have things like a network namespace that’s differentiates it between the network stack between the host and the pod running on it. This is why you can have a pod running on port 80 inside its own network namespace and it won’t conflict with other pods running port 80 as well. Because normally on a host, the network can only have one process running on port 80. So the network namespace is what allows us to do that. There are other namespaces as well. The process ID namespace, the user ID, the User namespace, there’s a couple more, I think there’s six or seven, I may be wrong with that number. But the one we’re specifically concerned about is the network namespace. And that’s what allows us to have the ports running on specific ports that don’t conflict. Those also need to get wired up to the traditional physical network that we think about or just the traditional network. I mean VPCs are also you’ve talked about those VPCs also aren’t physical networks. Those are virtual networks running on top of physical networks. So there’s lots of layers of abstraction.

Robert Blumen 00:07:50 Alright, you’ve introduced a bunch of things that I want to cover all those things in. More detail you are going to namespaces. That is something I want to come back to. I want to really understand what it does and why. And then go more into how, I think one of the key points you made is that when the pod boots up and it’s in its own namespace, it can connect to port 80 or 443 or whatever its preferred ports. Why is that important in the Kubernetes world?

James Strong 00:08:28 Well, like we’ve talked about, it’s a distributed system. So you could be running thousands of pods that are running on port 84, 443, 80:80. So in order to be able to have that density and have pods, you’ll have four or five pods possibly running on that port. So Kubernetes does that for you from that abstraction and it does that a translation for you. So it manages the mapping. So the pod thinks it’s running on port 80, but on the host network it could be running on 32,368.

Robert Blumen 00:09:01 Okay. So that’s important because one of the key value adds of Kubernetes is it enables efficient use of your compute resources. You could buy or lease whatever VMs based on the overall value of that VM and then you could put many different pods on that VM so you could efficiently use all the compute that you’re paying for versus otherwise you’d be faced with a decision of either I’m going to have one pod per VM or I’m only going to have one service that can use port 80 on a given VM. This approach, it overcomes that limit and allows efficient packing. Is that more or less going in the right direction?

James Strong 00:09:48 Yeah, that is one of the use cases as well.

Robert Blumen 00:09:51 So let’s go into, there are at least two ways this Kubernetes can work. It can share the same IP ranges as the host network, or it can have an overlay with a completely separate IP range. Could you talk about how those two works and what’s the difference between them?

James Strong 00:10:13 Yeah, so you can have, I think Cilium calls it a direct routing mode and VPC CNI, so in AWS it runs very similar to what you’re talking about is that the pod sider range, so that’s the IP address ranges that get assigned two pods running on the network run along are routed the same way that a host network is. So everybody knows how to reach those pods without having to do any other layer of abstraction. With something like VX LAN that gets encapsulated and it gets shared between the hosts be over that encapsulation tunnel, so that VX LAN tunnel. So it’s another layer of abstraction from that perspective, you’ll get a little bit of a performance hit because you are doing extra packet encapsulation and sharing that over UDP between hosts. So it all depends again what you need from that perspective. So if you have issues with, if you have IP address issues, like the space VX line could be helpful from that perspective. But if you could, I would use direct routing from that perspective. There’s one less layer to worry about.

Robert Blumen 00:11:32 The choice of those two would be determined by something you just mentioned, which is the CNI and Cilium being an example. Say more about what the CNI is.

James Strong 00:11:43 So the CNI is a project that helps abstract all of the networking stuff that we’re talking about. So adding, deleting, updating the host networks, the IP addresses, all of that. So when we think about Kubernetes as a whole, Kubernetes is made up of a collection of software that works together and even the container runtime is another piece of software. So the container runtime interface has a specification. The Kubernetes or the Container Network Interface has a specification that people implement that projects implement. Cilium is one of them, there’s others out there. Flannel, AWS, all the major Cloud providers have their own CNI for working on their Cloud networks. Even storage has a storage interface. So all of these collections of software talk to each other to enable the distributed system that is Kubernetes. So when a pod comes up, there’s another piece of software that’s running on there, that’s the kubelet. The kubelets going to say, hey, I need to run this pod. And it does that by having a conversation with the CNI. And so you can have a CNI installed that will do specific things for you and Cilium is one of those.

Robert Blumen 00:13:02 And Kubernetes, it’s really a control plane and when it needs to do something, it’s pretty agnostic about how it’s going to happen. It says I need storage, I need networking, I need a container. And then it delegates to one of these interfaces that knows how to do that on the specific infrastructure where it’s running. Is that more or less what you said?

James Strong 00:13:26 That is a much more succinct way of saying what I just said. Yeah.

Robert Blumen 00:13:30 Okay, great. Now if I’m running a managed Kubernetes on one of the major Cloud providers, I’m going to assume that Cloud providers opinionated on I should use their recommended or their branded CNI that works with their networking versus if I’m standing up a Kubernetes cluster that I’m going to manage, then I have to choose the CNI. Is that so far correct?

James Strong 00:13:58 Yes, but it is also changing. So again, I have the most experience with Cilium. So Cilium is an approved CNI that can run on those Clouds. So Cilium has partnerships with AWS, Azure, and GCP to run those. And actually I think open-source Cilium on GKE Dataplane V2 default is Cilium. So there are use cases where you can just use the Cilium CLI or the Cilium project versus like the default like on AWS, VPC CNI.

Robert Blumen 00:14:35 We do need to disclose Cilium open-source. And do you have any employment relationship with Cilium? We need to just disclose that.

James Strong 00:14:45 Yeah, no worries. So Cilium is an open-source project. It is in the CNCF and we do have an enterprise version that customers can pay for. So they get a plethora of other features outside of just the Cilium open-source and they get folks like me to help them with their implementations and support.

Robert Blumen 00:15:05 Great. Now in standing up Kubernetes cluster, if there are more than one CNIs that could work in that environment, what are some considerations that go into selecting the one that you will run?

James Strong 00:15:19 It’s always around the use case. So in that timeframe when we were talking about it, I also used to be a consultant. So really it’s what’s the use case? What do you need to do? So there are certain features and functionality that are available in CNIs. One of the examples we always talk about is network policies. So another thing to discuss and talk about, all pods can communicate with other pods. That’s one of the pieces of the networking philosophy in Kubernetes. The problem with that is a security aspect. So there is no default network policy that’s running in there. So if you want to restrict who can talk to who on the network in a Kubernetes cluster, you have to use a CNI that supports network policies. So that’s one of the base things you should look at from that perspective. And what I was talking about Cilium supports that there are others, Calico, the AWS, VPC CNI and I think you can do the security groups at the pod level as well. So things like that, but that’s one of the ones you can look at. So there are other advanced features and functionality. So if you want to be able to support a service mesh, some of the CNIs can support a service mesh from that perspective. So it really just depends on what your use cases are and the functionality that you desire.

Robert Blumen 00:16:42 We had an earlier episode, I’ll put in the show notes, about service mesh that can be implemented in a sidecar that runs as a container inside a pod. It sounds like you have some options as far as what layer do you want that functionality to exist at? And maybe it started out as its own pod, but if you know that you want to do it when you set up, the cluster may have a better home in the networking layer.

James Strong 00:17:12 Yeah so Istio does use a sidecar proxy with Envoy. I think they have a new mode called ambient mode that doesn’t use a sidecar proxy, but Cilium doesn’t use a sidecar proxy. It uses ePVF to do that routing for you for this, service mesh routing.

Robert Blumen 00:17:29 Okay. So let’s get into a bit more about the IPs and routing in the flat model, pod uses same IP ranges as the host network, but how does routing from pod-to-pod occur? Because the network interfaces would not have the pod addresses. So how does routing occur in the flat model?

James Strong 00:17:55 So we have to look at it from a pod-to-pod perspective, pod on the same host and then pod on a different host. So again, I’ll go with the example that I’m aware of. So Cilium when it runs, it knows what pod IP address ranges are running where and it installs those routes on the hosts for you. So if you are talking to like 10.1 from one host and your pods the destination is 10.2 on a different node, it’s going to do that routing for you. So it will have the routes installed on there and it’ll route that traffic for you. If it’s on the same pod, it won’t leave the host. So if the pods are on the same host, it won’t leave that. So it has that connection, so it knows that it’s on that bridge on that host. So it’ll just direct it to the pod that’s on that host. So it just depends on if it’s on the same host, on a different host,

Robert Blumen 00:18:51 On the same host the pod wants to communicate with another pod, it goes through an internal networking data structure called a bridge. It doesn’t need to go out on the host network.

James Strong 00:19:03 Yeah, this is where we start getting into the conversation of, it’s easy when we can draw pictures on this, but yeah, so that’s part of everything that gets set up from the CNI’s perspective, is that the network namespace gets created, the ethernet devices in that network namespace get created and they get wired up on the host that bridge network. So that’s how the communication between the network pod namespace and the network host namespace is through that bridge interface that you were talking about.

Robert Blumen 00:19:35 Okay. Now let’s do the case where a pod wants to talk to a pod, which is on another host. The pod is in a network namespace, it has a route, it says go to this network interface in the namespace and now somehow that has to cross the host network and get delivered to a pod, on another node. Can you walk through some of those steps?

James Strong 00:20:06 Yeah, so pod A is sitting on host A and it wants to talk to pod B on host B. So when it’s going to set that source IP is itself the pod A, the destination is going to be pod B. So when it goes out that bridge, it’s going to understand that it’s on a different network and it’s going to have to go either out its default gateway and communicate to the host to understand where it needs to go. Or like I said in the Cilium case, those routes are installed on the host, so it knows where to send those out and how to, how to route those. So again, this is where we get into the CNI either to manage things for you or the underlying hosts have to be able to communicate to each other. So in the Cilium perspective it knows to reach pod B, it needs to set the destination of that node B because the routes are there from that perspective.

Robert Blumen 00:21:05 The node B, it has an IP address that’s not necessarily in the host network or it doesn’t belong to the host’s main network interface. The thing that tells it which host to go to is there’s a route to pod B’s IP that says go out on this network? Something like that?

Robert Blumen 00:21:28 Okay. Now if I start up a node completely unrelated to Kubernetes, I start up a Linux VM, it has a default namespace, it’ll be given some routes to route things on it, local network or its VPC. With Kubernetes, you now need to add a bunch of additional routes to manage the pod network. Linux offers some different kernel or programmable extension points to allow you to do this. You mentioned ePVF. What are some of the tooling that enables Kubernetes to manage these routes?

James Strong 00:22:08 On a default installation, it’s going to use IP tables. So it’ll set up the routing and manage the routing using IP tables. That’s one of the other pieces of technology besides ePVF.

Robert Blumen 00:22:24 Okay. What are some differences between IP tables and ePVF?

James Strong 00:22:30 So ePVF, maybe I’ll start explaining that from that perspective. So what ePVF does, I think the overarching phrase that everyone uses is that it’s JavaScript for the Linux kernel. So it will do what JavaScript did for HTML, it does for the Linux kernel. So normally when you have to make changes or you want to run something inside of the kernel, you have to create a kernel module and you have to recompile the kernel and it’s a lot of effort from that perspective. So what ePVF did is it allowed us to create hook points into the kernel. So I can write some C code, I can run it through what’s called the BPF verifier and then I can hook into specific points into the kernel and understand I can make those changes without having to go through that whole loop of running a kernel module.

James Strong 00:23:22 And so that’s what Cilium will do for you or any CNI that is running with ePVF under the hood. So it generates and creates all of those hook points into the kernel and allows you to make those changes in the routing table, intercepting packets, things like that. From that perspective, that’s how that works from the ePVF perspective. And then from an IP table’s perspective, there are lots of rules and that gets very complicated from a chaining perspective and working through how that all works is probably not great for communicating without, we can put a really nice chart of how all that works from a pre routing and a NAT perspective and it can if, you ever look at an IP table’s rule set on a Kubernetes node, it’s a lot.

Robert Blumen 00:24:11 The answer to this might be it depends, but in a Kubernetes network, if you did look at the IP tables rule set, how many IP tables rules would he see?

James Strong 00:24:24 It is linear with the amount of services and endpoints that you’re running. So with services and endpoints, we have to be able to map those pods to services and be able to route all of that and we have to have chains for all of those.

Robert Blumen 00:24:40 One of the value propositions of Kubernetes is your host network can scale up or down and another is the scheduler can spin up or relocate services on the network to achieve optimal utilization or to recover from failures. Let’s say that a new node joins the host network, what kind of route adjustments need to be made on all the other nodes to incorporate that?

James Strong 00:25:09 So that’s one of the scaling issues of IP tables is that when you make a change, when you add a new node, all of those rules have to be propagated to that new node or, when you make a change to a service, you have to update all of those endpoints across the entire cluster from that perspective. And it can get from a performance perspective, not great. I don’t have the actual numbers in front of me. There’s, there’s lots of papers out there that talk about the performance impact of large clusters and large change sets on IP tables versus ePVF. So performance is certainly one of the aspects of why folks are migrating to it.

Robert Blumen 00:25:48 Are we talking with IP tables then? Seconds, minutes. How long to get all the IP tables rules?

James Strong 00:25:56 You put me on the spot on that one. I mean it can take time. Noticeable amount of time.

Robert Blumen 00:26:02 There are two parts here and I am not completely sure which one we’re talking about. You bring the new node into the host cluster, that new node needs to have its own IP tables or ePVF rule set so it can route to the other nodes and then the entire rest of the network needs to have at least a few rules added to route to the new node. Which of those is more impactful or takes longer?

James Strong 00:26:28 I wouldn’t have a direct answer for you as being a dynamic nature. So a new node’s going to come up, Kubernetes is going to do that balancing act for you, so it’s going to shift workloads to that new node and then everything has to be refactored from that perspective.

Robert Blumen 00:26:44 And there’s a point I want to make sure if I’ve understood this correctly, the pod does run in its own network namespace are these routing rules added on a per namespace basis? So each pod might have a slightly different set of routes?

James Strong 00:27:04 Yeah.

Robert Blumen 00:27:05 Okay. And then maybe we get into now what parts of Kubernetes are interacting with the namespace and the kernel extension routing scripts?

James Strong 00:27:19 That’s going to be the kube-proxy and or the container network interface. So as things are changing, we’re watching the end points, we’re watching the pods come up and down and those are the two pieces that are mostly responsible for that.

Robert Blumen 00:27:35 So we haven’t talked about the kube-proxy. What is that?

James Strong 00:27:40 The kube-proxy again, like we talked about the Kubernetes project, it has lots of different pieces of software and the kube-proxy is one of them and it runs on each node and it helps define all of those pieces. So if you say I have this service with these endpoints, it’s going to set up and run those IP tables rules for you and set all of those up for you. So that’s the piece that’s watching all of the watching all of that, the CNI can do that and it, there are some that replace the kube-proxy altogether.

Robert Blumen 00:28:15 It has a word proxy in its name. Is it anything like other things called proxy, like reverse proxies that we run in front of a cluster?

James Strong 00:28:27 It is the network proxy. So it’s the one that’s again setting up all of those IP tables rules and making sure that everything’s available from that perspective. So yeah, it’s proxying all of the pods to services communication or all of the communication in the cluster.

Robert Blumen 00:28:42 And that’s a per node or a per pod cardinality?

James Strong 00:28:47 Yes. Because it’s running the key proxies running on each node, just like a CNI was running on each node. So there’s specific software that runs on all of the nodes for that perspective and kube-proxy is one of them, but it is managing the services and the endpoints for all of the pods and deployments and services. So that’s why I answered the yes to your question.

Robert Blumen 00:29:11 There’s another part of the control plane we’ve not yet talked about that plays a role in the networking: the kubelet? Explain what that is.

James Strong 00:29:20 The kubelet has two main roles. So the kubelet on a host is the one that is watching when a pod gets scheduled to it, it’s going to start up that pod and kickstart all of the other things that we’ve talked about. But the kubelet also does health checks. So if you have health checks defined in your pod spec, it’s the one that’s going to be running those checks for you so that you make sure that the pods are healthy when you’re up and running.

Robert Blumen 00:29:45 At what point when the pods being started up, is it given an IP address and by who or what?

James Strong 00:29:53 That’s the interplay. So the kubelet is the one that has the communication with the CNI. So the Container Network Interface. So a pod gets scheduled, kubelet kick starts the process and asks the CNI for an IP address and to create that interface in the namespace.

Robert Blumen 00:30:10 And the CNI, based on how the network is architected, it has chunks or ranges of IP addresses that it knows are safe to hand out to a particular node? Something like that?

James Strong 00:30:26 Yeah, so when you start up a Kubernetes cluster, it asks for what that range is and then it carves up that range for all of the nodes in the cluster. So depending on how big the pod range is will depend on how many nodes can run in the cluster. You can look at something like the AWS, Epvc CNI where it actually has math set up. So saying that depending on the node size will depend on how many pods you can run on the cluster. So an M5 large can run 30 pods, I think it’s 29 pods. So if you want to run more than 29 pods because of that restriction on the networking on the host, then you have to run a larger instance. So there’s things like that. So when you’re architecting a Kubernetes cluster, the network size does matter for the size of the cluster.

Robert Blumen 00:31:16 Are there decisions you could make about IP ranges or network size that might turn out later to be not great decisions but are quite difficult to revert if you don’t like them?

James Strong 00:31:29 Yeah, so there are some things like if you have pods that you want to communicate across different clusters, they have to have a distinct pod IP address range. So there are issues from that perspective. Again, the sizing is an issue, but there are other tricks I keep talking about AWS because that’s the one I have the most familiarity with for the past few years. So on EKS you can add secondary ranges now. You can also increase the VPC, you can add another range to the VPC and then you can add another range to the pod sider ranges. So you can, they allow you to do things like that. So if you maybe properly scale the cluster from an instant size perspective but not the IP address ranges, there are some tricks you can do after the fact, but it does require some planning, especially when you’re doing multi-tenant or you’re doing multi cluster and having communication across clusters. So from a networking perspective, you still have to do what I would consider like traditional network architecting, making sure you have the pod IP address ranges properly set up, especially if you’re doing the flat networking that we’ve talked about in VPCs and VPC design in general, like if you’re going to peer VPCs, they can’t have overlapping ranges. So there’s lots of discussions. So I think the traditional networking engineering skills are still there, even in a Cloud native and a Kubernetes cluster perspective,

Robert Blumen 00:32:56 You may be involved in different planning circumstances then based on for cost reasons, you want some pretty large machines, but you have a very small type of pods, you’re going to be running lots of pods per machine versus you have some really memory intensive service where you’re going to dedicate an entire machine to that. And those type of considerations would figure into how large your IP ranges need to be. Am I going more or less in the right direction with that?

James Strong 00:33:28 Yeah, I mean I have customers who are running machine workloads on Kubernetes where it is a single pod running on a single node and there’s also networking tools that would like Ingress-NGINX or any Ingress controller you can have running on an individual node. So again, lots of planning that’s involved based on use cases and what workloads you’re going to be running on the cluster. Again, conversations with the platform team or with whatever teams managing your Cloud and Kubernetes clusters.

Robert Blumen 00:34:03 Okay, there’s another Kubernetes resource that is very tightly tied to the network. You’ve mentioned a few times the idea of a Service, which you might not know that that’s a Kubernetes resource when you hear it because the word Service is so ubiquitous in this type of engineering there is a particular thing called a Service. What is that?

James Strong 00:34:27 So this is again one of those things where it’s a Kubernetes obstruction, but it’s still very similar to a networking. So you can think of a Service as a virtual IP that remains, that is stateful. So it stays with the service. So when you spin up a Service, you’ll get an IP address that’s routable in the cluster, the entire cluster, which is essentially just IP tables rules that get put across the cluster. But the issue that it solves is that pods are ephemeral and pods can scale as needed, right? Kubernetes is a dynamic system. There’s controllers that are managing how many pods need to be running in the system. So if a pod goes down for whatever reason, it’ll bring up a pod with a different IP address because it’s on a different node, different pod IP address range. And so that changes and that makes things difficult from a DNS and a client perspective. So to help with that, Kubernetes has its abstraction of a service that has a single IP address that routes to all of the pods, that does load balancing in the cluster to those pods. So you can hit that Service endpoint and not worry about the pod IP addresses changing from that perspective.

Robert Blumen 00:35:42 For example, then say in my cluster I have a product catalog Service, it’s more intensively used some time of day than others because more people are shopping during daytime hours. Kubernetes will scale those up or down, yet all the clients in Kubernetes that want to talk to that Service will see that single IP and then that will get routed to one of the pods that is supporting that Service. Did I get that right?

James Strong 00:36:14 Yeah, that is one use case. I do want to mention though that pod auto scaling is also just another piece of software that monitors that from that perspective. So everything is a controller, everything’s a piece of software that has its own individual use cases.

Robert Blumen 00:36:28 And go a bit more into how this single IP address acts as a load-balancing abstraction to multiple pods.

James Strong 00:36:39 Yeah, so if you see there’s a configuration option when you start up a Kubernetes cluster, that’s the Service CIDR range. And so that’s the IP address range that’s going to be used for the services and on the backend that’s IP tables rules that are set up across all of the nodes and the clusters. So if you want to route something when you create a Service, it gets an IP address, the key proxy or the CNI or whichever one’s responsible for all of that sets up either the IP tables rules or the ePVF routing rules, whichever is the backend technology that the CNI is using. So all of that gets set up for you automatically based on those, on those softwareís. So when a new pod comes up from a Serviceís perspective, we use things called labels and selectors. So you label a pod, say this is at fu and I have Service that is watching for pods that come up with at fu. So when those labels and selectors match, it updates, it creates an endpoint and those endpoints get updated across the cluster saying if you want to reach one of these, here’s the IP address of one of those that matches it.

Robert Blumen 00:37:48 Is this an instance of a thing we talked about a bit earlier that when a new pod joins a Service that potentially every single node now is impacted because it may need to add a new route?

James Strong 00:38:03 That’s part of it, yeah.

Robert Blumen 00:38:04 The routing to the Service IP, will it load balance at the node level, or does it take into account that there might be four pods of the same type running on a very large node and that really should count as four rather than one?

James Strong 00:38:22 You’re asking a very good question. I think it is. I think it does take into account the number of pods are in there and I think it does equal path load balancing. I don’t think you can really change that on an IP tables perspective, but I think you can with certain CNIs that use case I’m not a hundred percent on, but I know that it does take into account all of the pods that are in the cluster. And then that also brings into account some of the other things like external traffic policy. So one of the big issues when you’re using Kubernetes is understanding the client IP when it comes in. So when we do the routing, the source net, so the source IP address changes because of the routing that happens on the cluster. So we lose the client’s IP from that perspective.

James Strong 00:39:12 So you can do something with a Service by setting up an external traffic policy and say it’s either cluster wide or it’s local. A problem with local is that if a Service isn’t running, if a pod isn’t running on the node, then it will drop that traffic. So you’ll have some issues with that but you’ll still get the client because it won’t do the extra hop. So you’ll have 50 nodes on the cluster, you have 10 pods running if you hit one of those nodes where it’s not running and your external traffic policy is set the local, it drops the traffic. So you have to make some decisions from that perspective. There are other ways to get the client IP from that perspective. And that gets into the one of the use cases we didn’t get a chance to talk about with, Ingress controllers or load balancers.

Robert Blumen 00:40:04 I’m not sure I understand this last example. So what is the need for the receiving node to have the client IP?

James Strong 00:40:12 For whatever reason that the application wants to track what the client IP is.

Robert Blumen 00:40:18 And did I understand correctly that the route from the Serviceís cluster wide IP might route to a node which does not have an instance of that Service on it?

James Strong 00:40:30 Yeah, it could. So it’s going to send it to a node because the IP table’s rules are set up across all of the pods. So I can hit a node and it gets routed to another node that has that pod running on it and that’s when you set it up as a cluster external network external policy. So if you have it set the local, it’s only going to try to access the pod that’s running locally to that node and if it’s not you’ll lose that traffic.

Robert Blumen 00:40:59 Okay. I wanted to talk a bit about DNS, does the Service resource interact with the DNS in the Kubernetes cluster?

James Strong 00:41:11 Yes. So each service gets a DNS name in the cluster from that perspective. So you can do a lookup, so if it’s, I think it’s Service name dot namespace cluster, local or default, I can’t remember the DNS name, but that does map to the service IP address.

Robert Blumen 00:41:31 So we’ve covered a lot of how it works. I wanted to talk about something you mentioned right at the top, which is that it’s often the network’s fault to where you might blame the network first. Could you give a troubleshooting experience you had on a Kubernetes cluster that did turn out to be something to do with the network? How did you diagnose it and how did you resolve it?

James Strong 00:42:00 One of my favorite tools is a container called Netshoot. So Netshoot has a lot of networking troubleshooting tools that you would expect to have. So Tcpdump, Curl, all those ones that allow you to do all of the things that you need to help troubleshoot that. So you can start up that pod and then you can start diagnosing what the issues are. If it’s a DNS issue, can I do an internal DNS lookup, can I do an external DNS lookup? Things like that from that perspective. There’s also a really cool tool to help diagnose and just look at network connectivity and it’s called goldpinger. I think it’s by the Bloomberg group. Yeah, it’s a great name, but what it does is it sets up a Damon set, and a Damon set sets up a pod on every node in the cluster and then they all start talking to each other and they start pinging each other and then you can have like a Graph UI and you can see where there’s connectivity issues and the cluster.

James Strong 00:43:05 So between those two pieces of software, I like to use those to help troubleshoot issues. I haven’t really had a chance to use the ephemeral pods that just came. I think they just came GA maybe 128, 129, I can’t remember. But you can attach like a network pod to distroless pods that don’t have a shell and things like that. Haven’t had a chance to use those, but that would be really helpful from that perspective. And the last one that we did that I looked at is that the kubelet kept killing a pod off, but when we got into the host and we were looking around, what had happened is when remember when we talked about the kubelet, we’ll use the routing rules that are on the node to try to reach the pod, to check the liveliness probes of readiness probes. Well somebody had created a host interface with the same IP address as a pod and it’s going to favor the local connected ones over the one that’s connected via an ethernet or a VE pair. So it was trying to route to that local host interface and it’s not running port 80, it wasn’t running anything. So it’s just, again, I liken it to its still normal networking, troubleshooting from that perspective. What can I reach? How far can I go out, what’s the latency look like? What are still trying to go through the normal networking troubleshooting with the little extra of the Kubernetes abstractions that we’ve talked about today.

Robert Blumen 00:44:34 That is a great example. That reminds me of another question I wanted to ask. If I bring up any kind of workload on one of the major Cloud providers, they’re going to provide a lot of metrics and dashboards of their VMs and their network. I could see traffic ingress and egress volumes on a per VM basis. Now if you’re running Kubernetes, do you need to build or add in something that will monitor the Kubernetes networking abstractions and why would you do that and what would you be looking for?

James Strong 00:45:10 Yeah, I mean pods are still using resources as needed. So you probably, you’ll still want to monitor those resources. There are tools that allow you to do that. Like Prometheus, you can do pod metrics monitoring from that perspective and you can shell those out to things like CloudWatch. I think EKS has some, I think it’s called Container Insights. They have metrics that you can look at to monitor your pods from that perspective. There are other tools as well, traditional monitoring tools. I know like Datadog has a container plugin that allow you to monitor those as well. So there are what people would consider, again like traditional monitoring tools that have adapted to be able to run in a container world.

Robert Blumen 00:45:58 Can you give an example of something that might get lost if you only could see network aggregated up to the VM level, but you might be able to see it when you have the resolution down to the pod level?

James Strong 00:46:14 One of the things that gets lost is that container context. So containers are just processes is running on a host. So you’ll see a host spike network traffic, but you won’t know why. Maybe you’ll just see the process ID of those things, but you won’t be able, it takes a lot of work to correlate that process ID to a container that’s running and to an application team that maybe just did a new deployment that had a memory leak. So having tools that are container aware is very helpful from that perspective to allow you to monitor pods and processes from that perspective. So yeah, you’ll lose a lot of contexts where you’ll just see 50/50 processes that and then you’ll have to do the mapping manually and nobody wants to do that.

Robert Blumen 00:47:02 We’re getting close to end of time here. I mentioned your book at the top. I understand a second edition is being discussed or prepared. What would be in the second edition that readers could look forward to?

James Strong 00:47:16 Yeah, so Kubernetes networking, it is for folks that are an entry level, I would say it’s a layered approach because we start with talking about just normal networking and then we build up to running the Linux networking that we just talked about to running on a Cloud provider with a Kubernetes managed Service and I think it was released about two years ago. So things like Gateway API hadn’t been GA yet, that’s probably the big one in there. From that perspective, I have a whole list of it that I’ve sent to my provider. We’ll probably talk a lot more about ePVF. There was a short stint in there from that, from that chapter we did a huge focus I think on Chapter two was just all walking through that IP table’s example that we talked about, like how those get set up, how they get mapped to services and endpoints. I think the bulk of chapter two was just on IP tables, so we’ll probably downplay that and migrate more into talking about ePVF from that perspective. Those are the two off the top of my head that I can remember. And then always just updating the examples and things like that. Probably adding a troubleshooting section because we didn’t really have a lot of troubleshooting from that perspective.

Robert Blumen 00:48:28 And we will link to some of those tools you mentioned in the show notes. Is there any place you’d like listeners to go where they can find you?

James Strong 00:48:37 I am Strong, Jay-Z on most platforms. So the Kubernetes and the CNCF Slack, GitHub. If you have an issue with the Ingress-NGINX controller, which I help maintain, please come to our community meetings every Thursday at 11 Eastern. We have more information about that in the Kubernetes Slack, in the Ingress-NGINX userís channel. I hang out there to help answer support questions, anything like that.

Robert Blumen 00:49:04 James, thank you very much for speaking to Software Engineering Radio.

James Strong 00:49:08 Thank you. It’s been lots of fun.

Robert Blumen 00:49:09 This has been Robert Blumen and thank you for listening.

[End of Audio]

SE Radio 619: James Strong on Kubernetes Networking

Show Notes

Transcript

Join the discussion

More from this show

SE Radio 646: Matthew Skelton on Team Topologies

SE Radio 645: Vinay Tripathi on BGP Optimization

SE Radio 644: Tim McNamara on Error Handling in Rust

Menu

Recent posts

Search

Search

SE Radio 619: James Strong on Kubernetes Networking

Show Notes

Transcript

Join the discussion

More from this show

SE Radio 646: Matthew Skelton on Team Topologies

SE Radio 645: Vinay Tripathi on BGP Optimization

SE Radio 644: Tim McNamara on Error Handling in Rust

Menu

Recent posts