Luke Kysow, software engineer at HashiCorp, discusses service mesh and Hashicorp’s open source service mesh, Consul. Luke and host Priyanka Raghavan conducted a deep dive of the features of a service mesh, including service discovery, health monitoring, infrastructure support, and security. The last segment focuses on how Consul talks to Envoy and also compares Consul to other service meshes in the industry.
- SE-Radio 361 on Istio
- SE- Radio 264 on Service Discovery
- SE-RADIO 385 on Zero Trust Networks
Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].
Priyanka Raghavan 00:00:16 Hi, this is Priyanka Raghavan for Software Engineering Radio. In this episode, we will deep dive into Consul, the service mesh from HashiCorp, and my guest today is Luke Kysow, Consul engineer at HashiCorp. Luke is also the maintainer and co-creator of Atlantis, an open-source tool for data form collaboration. Luke has given many talks at different software engineering conferences. So really excited to have you on the show. Welcome to the show, Luke
Luke Kysow Thank you so much for having me.
IPriyanka Raghavan Is there anything else you would like listeners to know about you before we jump into Consul?
Luke Kysow No, I think you mostly covered it. I’ve engineered at HashiCorp. I work on Consul and prior to that, I did a lot of devops and just regular development at a couple other companies.
Priyanka Raghavan Perfect. As you know, we’ve done a couple of episodes on microservices, as well as service discovery, or we did an episode two years back on Istio. The service mentioned Google and four years back, we talked about service discovery and obviously a lot has changed over, sorry. It’s actually five years later, 2021. So we’ll try to cover what’s changed since then. Yeah. So the first thing I need to ask before we actually jump into Consul is could you recap for our listeners, what is a service mesh in your own words?
Luke Kysow So when we think about a service mesh, I think the, the most common kind of way that this, this looks is you have every single service that you are running in. Your architecture has a proxy that runs in front of it, and that proxy captures the traffic that leaves your service and also captures the traffic that’s entering your service. Now that you have this proxy in front of every application, you’re now able to control the traffic that goes into and out of that application.
Priyanka Raghavan 00:02:18 And that’s kind of what a service mesh is, is the ability. Now there’s like this mesh of proxies, and now you can control where that traffic goes, whether or not it’s allowed into the application, you can control all that through, through a control plane. So I think that’s like what, when we think about service mesh, that’s kind of what the basic manifestation of it is. So in the 2016 episode, which was episode 264, the guest James Phillip from Consul is when he talked about the architecture being a constantly agent running on different nodes. And then the service discovery happens through the Casa protocol. You know, that’s a gist of what INSEAD what’s new and what’s changed.
Luke Kysow 00:03:04 Yeah. So we have this division now between the idea of service discovery, which is kind of the, the older concept. And there’s do this new world of service mesh. And it’s a little bit complicated. Like what exactly is the difference between these? Because in the end, you know, it’s just one service calling another service. We’ll just talk a little bit about service discovery with service discovery. What we have is we have a list of all our applications and all of their IP addresses and their ports. We’re able to know where this application is and then from within our applications. So for example, say we have a web service and it’s trying to call the database. The web service can, can look up within the service discovery mechanism. Okay, where is this database running? And then it can direct the traffic over there. So this lookup can be done through, through DNS was traditionally what it would, how it would work.
Luke Kysow 00:04:00 So you would use like example company dot database, and then the service discovery mechanism would tell you, Oh, the IP address for that is 10, not 10, 10, 10 on. And then I can just talk to that on port or whatever report. So what’s evolved since then, is now the application instead of kind of participating in that service discovery, where it needs a way to find out itself what the IP address is for the database. Instead, the application talks through a proxy. And so now that proxy is the thing that’s doing the lookup for aware of the databases. You look at that and you say, okay, well, that’s, what’s the point of the proxy. Then my service could have just done a DNS lookup, but now that we have this proxy there, the proxy can, can do so much more than just do a DNS lookup, right?
Luke Kysow 00:04:51 It can control where that traffic goes. It can retry that traffic. It can notice that that application is down on it. Use like a circuit breaker pattern where it doesn’t send it traffic, or maybe it can fail over to another data center where maybe the database is actually up and running. That’s really the evolution there is that. Now we’re kind of handing off this, where does my request go step to the proxy? And that gives us so much more power. And the reason we didn’t do that before was a, it was extra complexity that we didn’t need. But now that we have kind of this world of microservices and dynamic services, there is a lot more complexity in how our network works, that we do find ourselves needing to program it, add some smarts in there. In addition with the rise of platforms, such as Kubernetes, it’s a lot easier to deploy that proxy.
Luke Kysow 00:05:47 Now, alongside all of our applications, you could imagine back when, when James Phillips was on the show in 2016, and we didn’t have Kubernetes, if I were to tell you, okay, all you need to do now is deploy a proxy on every single VM. You’re running by every single application and configure it with a, with a certificate and rotate that certificate. That was a very large job. But now with some platforms like Kubernetes, where, and again, you’ll see this pattern of kind of programmability, where now we can kind of program what the application actually runs and what runs alongside it. So concretely, I’m talking about how we now have this concept of pods and we can actually add containers, Docker images to run alongside those pods. Now it’s so much easier to add that proxy to all of our applications, and that makes this pattern a lot more viable.
Priyanka Raghavan 00:06:47 There is a lot of stuff that you’ve told us in this introduction, but let’s like, sort of break it down. If I would ask you what are the main components of the service mesh. So I think one you alluded to was of course, service discovery, right? And also I guess, how, how healthy your resources are. Right. And then what are the other components?
Luke Kysow 00:07:10 Yeah. So the key pieces of information that we needed to know are aware are the services. So that’s the service discovery mechanism. So where is the database and what is this IP address? How can I grow up there? The second one is like you mentioned as well, like, is the database healthy? Can I wrote to that? And that’s really all that we need when we do start talking about which we can talk about later, the idea of like multiple clusters, then we do need some, some extra information, the other piece of information that we need. And it’s very kind of important to the service mesh is, is this service is web service allowed to talk to this database? So kind of the, the authorization there is web allowed to talk to TB. So those are the kind of the, the information we need. And then the components like the concrete components involved are the proxies themselves. And then what we call the control plane. So the application that’s programming those proxies and telling it what each project should be doing.
Priyanka Raghavan 00:08:13 So for the last two months at work, I’ve been struggling with bringing up open source API gateway. So I just have to ask you this before we jump into Consul, do I need to also have an API gateway when I have a service mesh?
Luke Kysow 00:08:27 Yeah, that’s a super, super great question. Yes. The answer is that the problems that an API gateway solves aren’t being solved at the same time by a service mesh. Although I will say that the two technology areas are, are becoming increasingly overlapping, but when we talk about what we think about with a service mesh, traditionally, it’s been service to service traffic. So it’s the service that’s talking to the database, it’s the billing service that’s talking to the payment service, let’s say. And so you’re kind of talking about this, the services, service traffic, whereas an API gateway is obviously external consumers trying to talk into to your API. And so the functionality is similar, but a little bit separate. So with API gateway, you might have a bunch of logic around developer tokens, who is this person coming in? Are they allowed to access this API rate, limiting, et cetera, et cetera.
Luke Kysow 00:09:29 So kind of more, you might have like an admin panel that lets you see which developer tokens have been called in. So there is a little bit more functionality kind of as a result of having to manage the fact that these are external clients, whereas for the service mesh, typically these are internal clients. And so we don’t necessarily have a bunch of tokens where we have to rate limit based on that, instead of you might do kind of like a service level rate, limiting the functionality is overlapping. However, they are kind of separate concerns at this time.
Priyanka Raghavan 00:09:59 Okay. That makes sense. Let’s jump into Consul. Now you did touch upon service discovery. How do services in Consul discover other services? The question I wanted to ask was how do you bootstrap the entire systems? You did talk about DNS, but maybe you can elaborate a bit more
Luke Kysow 00:10:21 For sure. When we talk about for how it all gets bootstrapped, you need the platform with the information about where the services are, kind of needs to be there first. So in the case of a platform like Kubernetes or nomad or ECS, or one of these kind of containers, schedulers, the platform already knows which services are running. The process of bootstrapping for Consul is kind of more about starting Consul and then telling Consul, Hey, here’s, what’s running in the system. We already know what’s running like a bridge system. Basically that’s going to be watching an API, for instance, the Kubernetes API watching for, for what’s running there and then telling Consul, Hey, this is running there. And we talk about something like a platform like virtual machines, where there is no kind of overriding scheduler that knows everything in that world. The services actually need to tell Consul like, ‘Hey, I’m running in that world.’ What we do is we have, either via API or via config file where when the service is running, it kind of tells Consul, ‘Hey, Hey, I’m running here. This is my IP address. This is my port.’ So that’s kind of how we see the service discovery information get into Consul.
Priyanka Raghavan 00:11:35 Okay. Can you also touch upon, like how do you find out if resources down on the health monitoring, how does that happen
Luke Kysow 00:11:43 Again? It’s kind of this, uh, the split between, depending on which platform you’re running on. So in platforms like, again, Kubernetes, nomad, ECS, those platforms actually already have the information about the health of each service and the reason they already have that information is because they need to make decisions about, Oh, this application is down. I might need to reschedule it or restart it, or at least give a warning to the user. So these systems already have this information. So in this case, it’s a matter of sinking that information into Consul and telling Consul, Hey, Kubernetes says my web service is down. You should know that too. So you don’t also grow traffic there. Now, in addition to that, which is kind of configured on the platform side, and usually this is active health checking. So the platform person who’s running the service, we’ll actually configure, uh, Alex differences, Hey, call me on my slash health endpoint every minute.
Luke Kysow 00:12:37 And if I don’t respond with a 200, treat me as unhealthy, the platform will then arc the services unhealthy we’ll sync that information to Consul we’ll know, not to wrote to that service or at least perform whatever we, I, you know, whatever we want to do there, there is an additional type of health checking called passive health checking, where we actually notice on our calls between the different proxies that, Hey, I noticed that the platform says the web service is healthy, but every single call that I’m making to this instance is returning a 503. And so we’re actually gonna be like, okay, well I think the platform’s wrong here. I actually think that that application is unhealthy and we’re internally gonna Mark that as unhealthy and we’re not going to wrote to it anymore. So there is a kind of a, almost two layers there of health checking.
Luke Kysow 00:13:23 So that’s like kind of the platform view of the world where they already own that information and we just sync it to Consul. And then when we look at a platform that, uh, for instance, virtual machines, where there is no overarching scheduler in this platform, Consul will be the one that’s doing the health checking. So you did allude to this earlier where our architecture for virtual machines is we have a Consul client, which is just a process running on every virtual machine. And so in that case, it would be that Consul client’s job to make calls to the application. However, it was configured, maybe it’s a slash health endpoint. So we would actually be making those calls to the application and then it would be noticing that it’s down. And then that’s how Consul would know that the application is no longer healthy.
Priyanka Raghavan 00:14:07 Okay. The next question I need to ask you will have to be on load balancing up. There was an interesting discussion in episode 264. I just listened to it a few days back. And Jim said that if I think the question that the host asked was if load balancing was a part of service discovery, and he answered that if service discovery was done, right, you wouldn’t need load balancing. What are your thoughts on that? That’s interesting.
Luke Kysow 00:14:40 I think maybe I can get out what he meant by that, which is that if I say I have two instances of a database and one of them’s under load that when asking me, Hey, what are the IP addresses of the database? Perhaps my service discovery mechanism wouldn’t return the IP address of the database that’s overloaded. And instead it would only return one IP address of the one that’s fine. And so that kind of that health information in the load information is built into the service discovery mechanism. However, I think it’s hard in service discovery, harder to do this kind of thing. Then in service mesh, it’s just kind of like, like any other field in, in our technology area here where sometimes you need that extra layer of programmability and an extra layer of control. When we end up putting all this stuff into the proxy and the proxy is the one that decides where to wrote these requests.
Luke Kysow 00:15:40 You now have just basically infinite control over where these requests route and what kind of load balancing you want to do. So for example, you might want to do load balancing based on already sent a hundred requests over to this database instance, I’m going to start sending these requests over here. So that would kind of be like this proxy is keeping track of its own, where it’s been sending requests, but you might also want to do some more complex load balancing around. I noticed that this instance of the database responds way faster than this instance. So I’m just going to set all my traffic to the one that’s responding really, really quickly, this kind of locality of where a load balancing that is very, very hard to build into a service discovery mechanism, because how would the service discovery system know that that place was closer to that that call would be faster for this specific instance, right? So you can see how when you get into it, there’s a lot of complexity and extra kind of goodies. You can get out of a service mesh because you’re, you’re at the proxy and you have full control and you also have full look observability into what’s happening with my request at this level.
Priyanka Raghavan 00:16:44 Well, you talked about observability and I was going to ask about that later, but I think maybe it makes sense to ask. Now I read about service level observability in the literature. What does this mean? Do you have a, a dashboard where I can see new services or how the communication requests are happening, right?
Luke Kysow 00:17:05 Yeah. There’s a, there’s a lot of acronyms around their service level observability service level objectives. So when we talk about like observability as service level, really what we mean is there’s metrics that tell us what’s happening with this service. There’ll be like application specific metrics. So you might want to have a counter around how many times a specific endpoint has been called, like a slash signup or something like that. Or you might want to have metrics around how long your, your database calls are taking. So something you might want to alert on there. Right? When we look at kind of, where does this tie into the service mesh before the service mess exists? And you’re, you just had your application, your web application calling your database. You could already instrument that. So in your code, you could put a start timer, then you can make a call out to the database.
Luke Kysow 00:17:57 And then when the response comes back, you look at the time and you do whatever. How long did that take? And then you, you call your metrics library and you admit some sort of metrics. And maybe that might fire off over to like stats D or graphite or something. So we already had this ability to instrument our obligations. But what we found was a, you now have to write this code, which is kind of annoying. Cause I want to just focus on my business logic. And also I forgot to write this code, my name for this, this time to talk to the database was request to underscore time. But the other services there, they called it request dash time. And so now when you’re, you know, you’re an operator and you want to like, kind of look at, I wonder how long calls to the database are taking you now have a hundred different metrics names, and they’re all using different technologies to, to emit these metrics, right?
Luke Kysow 00:18:49 Or they haven’t even instrumented it where the service mesh helps with this is again, because everything’s going through this proxy. Now we can emit the same name for that metric across all our different proxies. We can program the proxies to emit the information in Prometheus’s data or graphite data. So we have, uh, kind of this full control now where we can emit the same metrics. If the team hadn’t acted accidentally, they hadn’t actually instrumented their calls to the database. That’s not a problem because all their calls go through our proxy. So we’re definitely going to emit those metrics. So that’s where we see the service mesh really, really helping with, with this, where you can kind of have a metrics observability into your service, across all the different services across your app, across your infrastructure. We can talk about this later, but again, not only can you now get this information, but you can actually do something about it, right? If you notice the cost of your database are slow, you can maybe make a call. Okay, well, we’re going to spin up a second instance of it and we’re going to start routing 30% of the traffic to it or something like that. And you could control all this from like the operation center, basically.
Priyanka Raghavan 00:19:56 Nice. Very cool. Maybe I can switch gears a little bit into what kind of support for infrastructure you provide. I’m thinking in terms of how do you integrate networking with CACD? Could you talk us through that and maybe as a maintainer of Atlantis, do you use that a lot as well?
Luke Kysow 00:20:18 They’re connected in some ways and disconnected in other ways. So when we’re talking about deploying the service mess itself, that is more kind of, of, of a static thing that you’ll deploy a specific version of Consul, and that will run for a decent amount of time in your infrastructure. Maybe you’ll update a release every couple months to get the latest bug fixes or whatever you could kind of pick your CIC pipeline of choice for, for that deployment. Probably it would be whatever you use for your other services and Consul can run in Kubernetes and it can also run on virtual machines. So that’s kind of how you would, you would do that. And then when we look at CACD for our like applications themselves, obviously you can just deploy those as you would before. And kind of the, the one difference with the service mesh is if you’re running, for example, in Kubernetes, what we can do is we can intercept that, that request to, Hey, deploy my web service.
Luke Kysow 00:21:17 And we can automatically add in this proxy to that. And so when that web service comes up, it’s now got that proxy already running in that proxy is already connected to the Consul control plane and boom. Now that that web services is in the service mesh and we can, we can start to program its traffic and do certain things with this traffic, but it can also come into play to where, when we are deploying a new version of an application through or through our CACD pipeline, maybe what we want to do is we want to say, we have a hundred instances of it. We want to deploy 10 using the new version and do what’s called a Canary deployment where we wrote like a very small amount of our traffic to those 10 new instances and closely monitor the metrics. This is where we can see the power of being able to program your network and do whatever you want with the eroding. It’s going to be really easy to route only 10% of our, or only a small amount of traffic to those 10 instances when we have a service mesh and all of our calls are leaving proxies across the infrastructure. So we can just program those proxies saying, Hey, if you’re talking to the web service, I want you to sound, I want to send 1% of your requests over to these scenarios.
Priyanka Raghavan 00:22:28 So when you talk about programming your network, what’s the support you, you’re saying that you can define a new configuration, I guess, declaration, right? How do you do that? I, I guess I let you answer that maybe,
Luke Kysow 00:22:41 Right? Yeah. What we do is is we have this basically a configuration incriminate is this Yammel of course. And basically we have lots of different service meshes have kind of, they’re different, they’re different ways to configure this in Consul. We have a couple of different like types of configuration that let you, for instance, we have this one called a service splitter where you can say identified for example, that Canary deployment, and you can say, Hey, I want to send 10% of my traffic there. So what you would do is you would use a custom resource in Kubernetes, basically that has that information that says, Hey, requests, going to the web service. I want you to row 10% of them over to the V2 of it. When we look at kind of, if you’re running on VMs, it’s the same kind of thing. But instead of having got Kubernetes, Yammel we can do it via like an API call to Consul with like either Jason or HCL, but it’s basically the same, the same schema. And so, yeah, it’s all controlled via an API or in Kubernetes it would be like a yam or resource.
Priyanka Raghavan 00:23:40 Okay, perfect. I was going to ask you about AB testing support. I you’ve kind of alluded to it now. So this is the service splitter, right? Which you just talked about. Can you explain that a little bit? Is that like a, something on the configuration that you set up on your Yammel file?
Luke Kysow 00:23:58 Yeah, for sure. So yeah, it’s exactly that it, AB testing would be, it’s an interesting concept with, in terms of service mesh, because typically when I think when I at least think of AB testing, I think of, of a consumer, like a user using my website. And for instance, I serve the button and one case it was blue and the button and one case is red. And then I have metrics around which one gets clicked more often kind of thing, right? Within like the web service calling the database service. There’s probably less use of the concept of an AB test, but you definitely have this concept of, of needing to split traffic, maybe a blue-green deployment or Canary deployment. So yeah, what we would do is you have this, this config configuration entry where it would identify that I want to send a certain amount of traffic to the service 50%, for example. And then I think what you could probably do too, is, is you could identify like separate metrics for that, for that request that went over there. And you could kind of like see that on your, on your graphs, that the traffic that went to the blue button, then, you know, there was more requests over there, something like that. So you could kind of configure the metrics, that guy I’ll put it there.
Priyanka Raghavan 00:25:04 Okay. Perfect. Sounds good. I got my answers. So that was great. I would like to now switch gears. So we talked about service discovery. We’ve done a little bit about health monitoring, infrastructure support and my personal favorite security. So you talked a little bit about it in the beginning, but can you explain how you handle security in terms of authentication and authorization?
Luke Kysow 00:25:30 For sure. Yeah. And I’ll kind of preface this discussion with what we’ve seen and I’m sure if anyone’s been a security engineer at a company where they move to a platform like Kubernetes, for example, what we’ve typically seen for security, which is basically you have a firewall, you have security groups between your apps and you have kind of this control of like, okay, this IP address 10, not 10, not 10 can talk to, you know, IP address 10, not 10 dot 05 or whatever. And I have this rule and it’s listed down. And if I ever want to change it, I put a ticket into operations. I change it. Right. Very, very static based security. We’re we’re moving into it. We’re already, they’re really into this new world where it’s too dynamic for that to happen in a scheduler. You’re going to have these pods with IP addresses the change all the time they come up and down, you can’t have the, these static IP address lists, right?
Luke Kysow 00:26:23 What we need to do is focus on the, on the idea of like service identity. So it’s not about this IP address port pair that identifies a service, but instead that service needs to have some other sort of identity that we can use to authenticate it. So when the service calls the database service, how do we know that as the web service, we can’t look at the source IP anymore because that’s changing so fast, so fast. So, and in addition in places like Kubernetes, by default, unless you’re running with like network policy or whatever, that the network’s totally flat. So there is no firewall rules between what can talk to what by default is completely open. Anything to talk to anything else which developers love and security folks are probably, probably pretty scared about. So what we talk about in that case, it’s this idea of a zero trust network where you can’t base your trust on who can talk to who based on IP addresses.
Luke Kysow 00:27:21 So, so what do we use most service meshes? What they will do is they’ll actually use, uh, MTLS mutual TLS. And what that is is when the web service talks to the database service, it speaks over SSL. So HTTPS not only when I talk to. So for instance, when I talked to google.com, my browser will make sure that the certificate served by that as actually served in owned by google.com. So I can’t get, what’s called a man in the middle. What mutual TLS is, is not only do I verify the google.com is google.com, but Google will verify that this is Luke is making this call to me. So when we go to our web and database example, the web service gets a certificate that actually identifies it as web it’s, cryptographically verifiable, it’s been signed by a trusted authority that says, yeah, this is the web service.
Luke Kysow 00:28:13 And so when that request goes over to the database service, not only can the web service verify, okay, I am talking to the database, I’m not sending my database password summer incorrectly, but the database service can look at the client certificate, which is coming from, from the web because that’s a client. It can see that it’s been cryptographically or it’s been signed by the trusted authority and that it actually is the web service. So that’s kind of the first kind of the first layer. There is this idea of mutual TLS. So we both know who each other is. And we both are trusting that each other is who we say we are. And then you have kind of in Consul, we call us intentions. I’m sure there’s, there’s many different names of it for the service, different service messages, but it’s basically just a, an access control list, which says, okay, I know this is the web service talking to me.
Luke Kysow 00:28:59 Are they allowed to talk to me? And so we’ll look up our in our list and it says, okay, yes, web is allowed to talk to me. We have kind of, you can go even deeper with, with Consul. We have this concept of layer seven, which is basically like you could look at the HTTP path that is using. So, okay. The web service is allowed to call the payment service at slash billing with like a get request, but it can’t do like a post request to like slash admin or something like that. So you could kind of get even deeper into these, these authorization and the security layers, but the overarching concepts are this idea of MTLS where all communication is encrypted and over SSL. And then each side of the party whose party to this communication can that they’re talking who to expect to be talking to.
Priyanka Raghavan 00:29:43 Okay. Actually, I was going to ask you about zero trust because we did an episode 385 on zero trust. And in one line, I just, I think you’ve explained it beautifully, but one line that really struck with me during that episode, the horse was actually quoting a Google engineer who said, do not trust your network. It is probably already owned, but you know, that kind of perfectly, I think you’ve answered it because if I were to summarize what you said, you really don’t trust any of the new nodes that are coming up, unless you, I mean, you basically don’t trust anyone and then it, it all depends on your certificates. And this MTLS that you talked about. And also you said you have further granularity, right? Yeah.
Luke Kysow 00:30:27 I think zero trust networking. It sounds kind of a bit confusing maybe. And, and really what we just mean is you can’t use a firewall rules anymore based on IP address. That’s really what it is. It’s like, it’s like you’re running your, all your networks on like the public network on ECE to public or whatever it used to be called. Anybody can talk to you and you can’t just set up firewall rules. Obviously you still have defense in depth and you should actually have firewall rules, but that’s generally the concept there,
Priyanka Raghavan 00:30:58 Right? I guess your role, you can just really trust that a perimeter defenses good and younger. See exactly it. Also get back into what you talked about, you know, this multi-platform service mesh. So I could have, uh, like for example, in my company, we probably have a Kubernetes cluster running on Azure and we do have some legacy VM running some important jobs, right. There could be, I mean, there are some VMs in our data center, but we also have some VMs on, on another cloud provider, like say, so how does Consul handle the situation?
Luke Kysow 00:31:42 Yeah. This is something that is extremely common. It’s very, very difficult to handle. At least in, in the world before service mesh, we try to make it a little bit easier with the service mesh. So what we have the concept of with Consul assist, if you have a different data centers, however, you can also think about it as, as different clusters. They’re really kind of interchangeable there when you have your, your Azure Kubernetes cluster running and you have your, your virtual machines running. Ideally, what you want is, first of all, you want that zero trust networking. So you want, when the legacy say the web service that’s running over in, in your Azure cluster or your EKS cluster wants to talk to the database that’s running, on-prem, that’s waving what you call it legacy. You want that request to be encrypted, obviously, as it goes over there.
Luke Kysow 00:32:33 So that’s the first thing. And the second thing you want is you actually want it. It’s actually not that easy probably to route over between the two of them, because often when we’re talking about multiple different data centers, clusters, the networking spaces aren’t overlapping. And so you can’t just take the pod IP and talk to it from your on-prem data center that pod IP doesn’t exist. It doesn’t know where to wrote that, right? When we’re talking about what multiple clusters we need to add in this concept of gateways. So gateways are kind of sitting on the edge and they have an IP address that is rotable from either cluster. So maybe it’s a public IP, or maybe it’s still private, but it’s not a pod at P you know, it’s something that actually exists that you can resolve. So that’s kind of the first key piece.
Luke Kysow 00:33:25 There is you need this, this thing that sits on the edge, that is rotable obviously you need to get somewhere. It needs to be profitable. We also have the concept of VPNs where everything’s within the VPN. So everything is rotable. So in this case, you probably wouldn’t need this gateway, but then when you’re talking about a VPN, we’re kind of going back to what we’re talking about, where this perimeter security, right? So we can’t just like blindly trust our VPN. So that’s the first thing we need when we’re talking about multi data center. The second thing we need to do is we need to be able to communicate what’s running where, and whether it’s healthy or not. Right. That seems pretty easy on the face of it. Just make an API call to find out. But one of maybe one of the reasons if we’re running multiple data centers is for fail-over and for reliability.
Luke Kysow 00:34:08 So what happens when this data center goes down, this mechanism of understanding what’s running in East data center, it needs to be really resilient to failure. Those are kind of the main concepts at play here, or is like the roadability usually through gateways. There’s the security aspect of it. So you say that you’re the database. How can I trust that you actually are the database? And then there’s kind of the reliability and fail, fail over aspects of it. Uh, so concretely, what that means in Consul is we would have you set up that data center Consul data center, which is basically just a set of Consul, like processes running in, in your EKS application, install it through the helmet chart. And then you’d also set up a set of Consul servers in your, your on-prem cluster there, your legacy cluster, and those would be connected and they would trust each other through a set of shared certificates.
Luke Kysow 00:34:57 And then depending on how your, how your networking setup, we would have some gateways on the edge so that we could actually like call between the two data centers. And then when the, the web service running in Azure wants to talk to the database, it just talks to his local proxy. And that proxy handles it from there under the hood. What it’s doing is it’s going to be, it’s going to be using a certificate. It’s going to be recognizing that the database isn’t in, in Azure, it’s actually over in this other cluster. This legacy cluster is going to be sending it over to the gateway for that legacy cluster. The gateway is then going to be looking at the requests. It’s going to see that it’s signed by a trusted certificate. It’s going to let that request through, and then it’s going to be routed directly to the databases proxy. And then the proxy is going to look at that request. It’s going to look at the certificate. It’s going to say, okay, it’s from the web service and web is allowed to talk to me and it’s going to route it over to the database. So under the hood, it’s actually quite complicated, but from the perspective of each cluster is kind of easy. You just talk to your proxy and everything gets handled from there.
Priyanka Raghavan 00:36:00 Okay. Sounds probably easier also because you’re dedicated is talking to your local proxy. So
Luke Kysow 00:36:06 Yeah, definitely each application doesn’t really need to know where other applications are. So say you were to migrate that database later somewhere else. You could just do that and make sure you don’t have to really make any changes to the web service. You don’t have to be deployed or anything.
Priyanka Raghavan 00:36:21 Okay. I wondered also like go find out from you, if you could, again, elaborate a little bit about this failure and retrace, because I think all this distributed communication needs this infrastructure support for this scenario. Right. Can you just like go over that again?
Luke Kysow 00:36:37 Yeah, for sure. So as we saw the move from monolith to microservices, what you used to have was your function call now was actually a network called and occasionally that would fail. So we started, at least what I did, where we used to use to do was we would build some, like some for loops into our code where if it fails, just retry it again. Basically you end up writing a bunch of logic in your application that handles kind of these failures. And then we, you know, some, somebody would get fancy and they’d implement like exponential backoff or something like that. Right. But each application developer needed to do this, uh, themselves and every single application for every single network call. At least what we found was some of them would have something right, where they’re like, well, we really need this call to succeed.
Luke Kysow 00:37:23 So we’re going to retry it 20 times. What would happen is that that service they’re calling, maybe it gets a bit slow. And so maybe that first call doesn’t succeed and suddenly it’s getting hammered with 20 times the traffic because there’s application in it in its wisdom is just retrying the call a ton of times. And now the application is actually hard down because it’s getting 20 times as much traffic as it used to. So that’s kind of setting the stage for, for where we, where we started to go as an industry. And there were libraries out there that would kind of be smarter in it, but, you know, that was kind of the problems we had to solve. And so what we get with the service mesh now is don’t write a for-loop in your application that tries us, never call 20 times and said, try it once, send it out to the proxy.
Luke Kysow 00:38:12 And now inside the proxy, we can control that 20 loop. Maybe we do want it to loop 20 times, but maybe the service is down and we actually only want it to loop once or to wait five seconds before it tries again. Now all this logic kind of gets put into the proxy and then we can configure it similar to what we were talking about before with the service split, we would have a configuration file or a Yammel file, which basically configures the behavior for fail over. And I can go quickly over kind of the different things you could do. So, one thing you can do is you could retry, you could retry with an exponential backoff. You could do a circuit breaker pattern where you recognize that that service is hard down. And so instead of making the application and who’s making that call, wait for like a one second until the call times out, you just fail it right away.
Luke Kysow 00:39:02 So immediately that call fails. The application still needs to deal with a co-factor that call failed, but it doesn’t have to wait one second, which might mean like you get an error page really quickly from a user instead of having to wait for the, for your, for your call, the timeout. And another thing we can do when we, we did just talk about multi cluster, but we could maybe look at failing that request over to one of this other clusters that we think is still up and running. So ideally we only talked to our local cluster because it’s faster, but in, in times of trouble, we might fail over to this other cluster over here, the call takes a bit longer, but at least it succeeds. That’s kind of what we look at when we look at how do we, how do we handle failure? And the only other thing I’ll add is that when we go back to observability, we now have a common set of metrics with which to diagnose it. This is failing. So it’s very complicated. Architectures will have every single proxy is emitting a metric. The saying the database is down. That’s pretty darn good information to the databases down.
Priyanka Raghavan 00:39:57 Yeah. It’s just got me thinking a lot of stuff. Exactly. Yeah. But coming back to the episode. Yeah. So this is great. I mean, we’ve, if I were to just recap what we’ve just spoken about, we’ve led us through service discovery, um, health monitoring. And right now we spoke about this, uh, multicloud support and also this failure and retry scenarios. Can we step back and also talk a little bit, which I should have asked you in the beginning, but I got so into this, but could you take, tell us about how this control plane and the data plane work, you know, Consul in one way,
Luke Kysow 00:40:36 For sure. Yeah. So at a high level, when we think about data plane, what we mean is each application calling the other applications through the proxies. So this is the data. So if you tried it, it’s kind of hard on a podcast, but visualize in your head, you have like on the bottom of the diagram, we have like application web talking to application database, and they’re talking through a proxy, that’s the data plane. And then above all that sits the control plane. So this is a Consul or Istio, and this is talking to the proxies. So it doesn’t talk to the application. It configures the proxies. Now obviously there’s a lot less communication going on here, but you know, the, the control plane needs to tell the proxy, Hey, split half of your traffic over here, or, Hey, I want you to retry every one second or something for this. So that’s what we call the plane. So the data plane is the applications talking to each other through the proxy. And the control plane is what’s controlling all the proxies.
Priyanka Raghavan 00:41:41 Perfect. The next logical thing, which I’ve not covered right now is also about performance. And how does that work with having this extra layer with Consul? Like, do you do something similar? Like you do something or for clients who demand that you do something special for performance critical applications?
Luke Kysow 00:42:01 Yeah, that’s a, that’s a great question. And I think that is something that’s often missed. When we talk about kind of the wonders of service meshes there is, or there can be a performance implications because you’re taking this request that used to leave your application and go over to this other one. And now that’s hitting some other process which now has its own memory and it needs to process these and senior CPU cycles to process it. Then maybe it’s doing some complicated things. Maybe it’s doing SSL like, right. So it’s not, it’s definitely not a free lunch. If you’re talking about extremely lowly NC applications where micro-seconds matter, then adding a proxy in front of everything is definitely gonna make an impact to your, to your application. And, and there’s, there’s other developments around like EBP AF this idea where, why don’t we do a service mesh, but we do it inside the Linux kernel itself.
Luke Kysow 00:42:54 Uh, so it’s a lot faster. So we don’t have to go kind of swap back into userspace. So there are, there are developments with these kind of extremely low latency requirements, but I think what’s worth mentioning is that most applications don’t notice microseconds. They’re going to notice a couple of milliseconds, but a couple of microseconds are not going to make a big difference. And instead, what makes a big difference is being able to build applications faster, deploy them, migrate them, control them and handle failure faster. So for many of kind of the microservices use cases, the extra performance hit you get from a service mesh is well worth all the control you’re getting. And in some cases it’s actually faster because in the case of load balancing, where we’re able to intelligently route to the applications, that’s closest to us, whereas before we were kind of just blindly firing off to all the different instances available, you could actually see your request to actually complete faster because you’re talking to the one that’s closest to you. There is a performance hit and it’s not a negligible, but in general, what we find is that it’s, it doesn’t make a huge difference to these applications. We’re talking about maybe an extra couple of milliseconds, the control it gives you is well worth it.
Priyanka Raghavan 00:44:10 Now that we’ve sort of covered a range of things that the Consul does, how would you differentiate it against the other service meshes out there like Istio or liquidy
Luke Kysow 00:44:21 Consul? What we’re very focused on is this idea of multiple platforms. So Consul started out of, out of virtual. A lot of the service meshes kind of were Kubernetes first and Kubernetes. Only some of them still are. So that’s a big thing for us is this idea of how can we make a, kind of a, the UX, the user experience of the service mesh and how to deploy it the same across multiple platforms, be it Kubernetes or ECS or VMs. So I’d say that’s kind of what we focus on as our differentiator. And then also we do focus too on like tie into other HashiCorp products. So for instance, like you’re using a fault then, um, you can use like vault to control some of the security of, of the Consul service mesh. So that’s kind of what, what, what we, what we think of as kind of our special sauce there is, is multi-platform.
Luke Kysow 00:45:17 And then we also have a program called the HashiCorp cloud platform where you can actually get a hosted version of Consul. So you don’t have to manage kind of all the operational complexities of running your own service mesh. So that’s another really, really big initiative for us where we have our own team of SOP to kind of manage this for you. And again, because we’re, we’re not owned by a specific cloud provider, we can provide this across different clouds and we can provide multicloud support. Um, so that’s another, another thing that I think is worth calling out is kind of like we were always about multi-cloud multi-platform and we don’t really have like specific tie into a specific cloud. So I think those kind of like the, the differentiators for us there, but they’re all great. I think all, all of the services are good and depending on your use case and, and, and what you’re doing, I think all the folks working in the service industry are doing really, really cool stuff.
Priyanka Raghavan 00:46:08 Okay. Okay. So can you suggest some good resources for our listeners to, you know, look up on Consul?
Luke Kysow 00:46:15 Absolutely. This podcast would be great, but you’re already listening.
Priyanka Raghavan 00:46:21 I liked that
Luke Kysow 00:46:23 We have a, I think the best place for folks to start is we have a really great website called learn.HashiCorp.com. And it’s all about kind of these hands-on tutorials for kind of just getting jumping into the tool and that at least personally, that’s how I like to learn. Of course you can, you can read all the documentation too, but I like to just play around with things. So I think that’s probably where I would tell folks to start is go to LearnDashGroup.com, look at the Consul, different tutorials we have there, and just start to start walking through it, get, get Consul installed and it start messing around with it. And I think that’s a really great place to start.
Priyanka Raghavan 00:46:56 Okay. I’ll make sure that I put that in our show notes. Okay. Uh, where can people reach you if they wanted to like touch base? Like somewhere on Twitter?
Luke Kysow 00:47:04 Yeah. The best place to reach me is on Twitter. My username is L K Y S O w L K. So, hit me up on Twitter. My DMs are open and I’m happy to discuss anything.
Priyanka Raghavan 00:47:17 Thank you so much for this interesting conversation. I feel very enlightened like Buddha.
Luke Kysow 00:47:23 Thank you so much for having me.
Priyanka Raghavan 00:47:30 Thank you. This is Priyanka Raglan for Software Engineering Radio.
[End of Audio]