Alexis Richardson of WeaveWorks discusses gitops. GitOps is a deployment model for infrastructure and applications in which commits are approved and landed in a git repository, and a deployment agent continuously applies the latest commit from the repository to production. Host Robert Blumen spoke with Richardson about the origins of the gitops approach, at what scale do Jenkins jobs and build scripts break; the separation of the front end process for landing commits and the back end for applying them; the concept of convergent infrastructure; applying infrastructure as code to deployments; how gitops enabled WeaveWorks to recover from a wipeout of their entire infrastructure; the flux open source project; and how gitops can help with complex regulatory compliance requirements.
- Mastering Kubernetes security and compliance with GitOps -ebook available from WeaveWorks (free – requires you to fill out a request form)
- Practical Guide to GitOps ebook (free – requires you to fill out a request form)
- Gitops community
- What you need to know: guide to GitOps
- Dmitrii Evstiukhin The Why and When of GitOps
- Alexis Richardson What is GitOps, Really?
- GitOps – Operations by Pull Request
- Anita Buehrle A Practical Guide to GitOps
- Dmitrii Evstiukhin GitOps: How to Ops Your Git the Right Way
- GitOps FAQ
- flux on github
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
SE Radio 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected] se radio is brought to you by the IEEE Computer Society. As your belief software magazine online at computer.org/software
Robert Blumen 00:00:22 For Software Engineering Radio. This is Robert Blumen today I have with me, Alexis Richardson, Alexis is the founder and CEO of Weaveworks. And as the chair of the tech committee of the cloud native computing foundation prior to weave works, it was the head of product for spring at pivotal Alexis. Welcome to software engineering radio.
Alexis Richardson 00:00:46 Hi, Robert. Nice to be here. Thank you for inviting me.
Robert Blumen 00:00:49 We are going to be talking about, get ops. We’ve worked is one of the best information sites on the web for this topic. And there’s a number of eBooks you can download there that we’ll link to in the show notes before we get to what is good ops Alexis, let’s talk about what is the problem domain that get ups is a solution.
Alexis Richardson 00:01:13 Great question. So fundamentally get ops is a solution to the problem of management and deployment of applications in the cloud native space. Let’s say you want to change an application. That’s running with a new update that might be a different logic or feature for the website or the backend microservice or other more complex changes. Then you deploy that into the running system. That is you update the running system with the new code in cloud native. That means you’re deploying usually a container or a references to containers. You also have configuration changes which are edits to incubate as world Yammel files. Typically these changes collectively form an update. You can have more complex changes. For example, you could imagine changing an application, a database schema, uh, monitoring information dashboards, all that altogether. So updates can be transactional. You can also have side effects like firing Lambdas on Amazon.
Alexis Richardson 00:02:17 After you’ve completed an update. All of these things are forms of deployment. Something else that’s really important is that you can roll back. So in get ops land robot is usually roll forward to a new update based on an earliest state rather than a reversal of the operation. Another kind of update is when you put a security patch out, uh, for managing say, um, security in a cluster or a platform running on the cluster. This is quite a common use of get-ups also upgrades are a form of deployment. So collectively you can see that lifecycle management, uh, updates, patches, uh, are all forms of deployment and management that get ops addresses. One last thing about gifts that I must’ve forgot to mention is the concept of drift correction. Another kind of management, let’s say that your app or your cluster drifts away from the correct state defined in the config. A good, good ops solution should correct it. Then Kubernetes land. That means you tell Kuba that he’s Hey, you’ve drifted from your correct state. Please get back to the right one. Kubernetes alone is not enough to do this. You need multiple tools to do this properly. Okay.
Robert Blumen 00:03:25 In your discussion, you’re covering what some people might think of as separate things, but in your description, your treatment together, which would be changes to infrastructure and a new release of the application code, you consider that to be the same problem,
Alexis Richardson 00:03:44 Technically yes, but very often handled by different teams in different ways and with different frequencies. What we see most commonly, especially in larger enterprise customers, where we do most of our work is the separation between our platform team sometimes called platform operations or multiple platform teams and an application dev team. So what are the differences between these let’s list? A couple, the platform team is typically associated with central operations. They have made decisions or are implementing decisions from another group, which set what the platform will be on top of which applications will be deployed. You might have a machine learning platform. For example, once those decisions have been made, updates are typically upgrades and patches rather than frequent changes. Whereas in app dev where you might have lots of little machine learning apps that everybody’s riding all day long, and those might be updated once twice 3,000 thousand, 10,000 times a day, depending on which organization you work out. And so yes, there is a significant difference between the platform team and the application development team in terms of how they work and the expectations and constraints
Robert Blumen 00:04:58 Before we get into what get ops is, there are existing solutions in this problem, domain, what are some, or at least talk about one existing solution and what you don’t like about it?
Alexis Richardson 00:05:12 That’s a really tricky question to answer because it requires some sort of pretty subtle knowledge of a lot of the systems. But I think that it’s fair to say that the concept of get-ups began to show its face in the 1990s, where the writings of Mark Burgess, when he wrote about some of which became later on CFEngine and, uh, subsequently a number of companies such as Microsoft tried out different things. Like there was a thing called the model driven data center for awhile at Microsoft, where the idea was the whole data center was automated and you just had a model of it. And that model lived somewhere source control system, perhaps. And the automation was operations done for automatically in a secure, separate way from the model. This is basically get offs. A few years later, people started to use tools like subversion and then get much more commonly.
Alexis Richardson 00:06:08 And we saw a rise of DevOps and continuous delivery. And here I would recommend the writings of Jess humble, Dave Farley, uh, Andrew Clay, Shafer and others, uh, around 2007, 8, 9, 10. Um, there is even a quote in the continuous delivery book saying we imagined a world where I’m paraphrasing. We imagined a world where all changes are handled autonomically by agents running inside the system itself based on an external description version control. That’s good ops. So the idea preexisted, the modern tools, we started to see real progress here with puppet and chef, which are based very much on the descriptions of machines, but were invented before containers and immutable infrastructure. Then the next generation was Terraform Terraform for me is one of the first real modern get ops tools. But one thing that Terraform lacks in its built-in state is automatic drift direction. It doesn’t tell you, Hey, I think I’m out of shape.
Alexis Richardson 00:07:07 I need to go back to the correct state for you before I tell you that something might be wrong. And then now we have a generation Coobernetti’s flux, which originated at weave flagger, which does progressive delivery. Um, and there are some other technologies out there just to list a few Jenkins, X tech, Tom Argo CD are all examples of things that I say implement, get ups. I would say the principle difference between the modern generation and the ones I mentioned earlier is that container native, uh, they’re based on automatic drift correction, they’ve got a much more complete description of the stack. And as a result, they can do more for you. So to some extent, it’s it as a difference in implementation rather than a difference in concept. What
Robert Blumen 00:07:53 I take away from that is this, what you’re calling the modern trend is what might be known by another name is infrastructure as code or the code is the model of reality. And then you apply the model, try to bring reality in line with the model and get ops is yet another application of that. Did I get that correctly? Yes
Alexis Richardson 00:08:15 You did. And that then raises the immediate question and will want to change. And the answer is we’ve gone up the stack. So it’s not just infrastructure anymore. An example would be deploying a Canary. Let’s say that I have a thousand trusters running and I’ve got 2000 applications running on them. I want to deploy changes to that. I don’t want to take the system down. So I start using the so-called Canary pattern, for example, where I might select one or two clusters and I might change one or two elements of the microservice, maybe using a feature flag or some other, uh, routing, uh, trick to guide some traffic, but not all to the change code. And then I observed the differences and I say, Hey, it’s this new code running better for me than the old code? Well, am I happy with my new feature? And if I am, then I increase using a dial, the amount of traffic that’s flowing to the new setup. And eventually all the traffic is running on the new setup. So we do this using get ups today. And, um, it works incredibly well. We can also do, as I mentioned earlier, things like patches up and down the stack and much more so we’ve, we’ve grown and enriched the depth, the breadth and the detail with which we can do what used to be called infrastructure as code and hence was only for infrastructure, not just applications.
Robert Blumen 00:09:36 Okay. So it’s an extension of the infrastructure as code model
Alexis Richardson 00:09:42 Very much so. Yeah, I think 95% of infrastructure as code and then a 5% on top of that, it’s just opened up a whole new world of things that you can do. Okay.
Robert Blumen 00:09:52 So that this is a very provocative statement. I want to come back to that. I think we do need to get to the big question here, which you’ve partially answered what exactly is get ups,
Alexis Richardson 00:10:05 Get upstairs model driven automation to give it a simple word. But I mean, for that’s kind of a fancy term, I would say it’s operations made simple for developers who don’t want to understand operations because instead they just want to make changes to the definition of the app. And what that reflects is that we have better infrastructure tools for doing operations for us. Now we don’t have to be quite so hands-on about what’s inside our stack. So just as I remember when I first drifted into the world of computing and it, there was still lots of people building their own machines, you know, you’d have, everybody would have their favorite box in the corner, which they’d assemble with their own motherboard and who knows what else and their own customized version of Linux even. And so on, there would be many debates about what was best. Uh, nowadays I think the number of people doing that have shrunk, although there’s probably more people doing it with drones, I guess, or something else, because we frankly don’t need to spend time dealing with how to assemble servers anymore. And soon we won’t want to spend time dealing with server infrastructure anymore. We’re almost at that point,
Robert Blumen 00:11:12 You mean by that, are you referring to Kubernetes and containers as the substrate?
Alexis Richardson 00:11:18 They are the current wave. I think there’s going to be more in the future. I expect. Um, what Kubernetes and containers are doing is they are providing a sort of hybrid between an application or web server on the one hand and a distributed operating system on the other to give you a set of API APIs, to which you can write applications, which then can be managed for you using somebody who run those, how to run Kubernetes. And that’s somebody could be Google GKE or Amazon EKS or Amazon, or as your or someone else, or it could be, you know, your platform has set up a farm of 20 clusters for you, but either way as an app developer, I don’t need to care about that anymore. And it’s getting simpler every day. So for me, I think the next objective should be switching on a cluster and using it should be like turning on your phone.
Robert Blumen 00:12:07 So based on my reading on this topic, I don’t think we’ve gotten a complete answer to what is get ups. I am going to proceed on this by asking you about the four laws of get-ups and the first one being the entire infrastructure is kept in get and get as a single source of truth. We have two more of the four laws. Would you like to speak to that?
Alexis Richardson 00:12:35 Okay, well, I didn’t write the followers. That was Cornelia who wrote those recently. Uh, let’s see if we can go through them one by one. Okay. The thing that everything is described declaratively and lives in get, let’s talk about that because I think that get hair is a little incidental. It’s just that get is the, the modern paradigmatic best case version control system that we use. And, uh, let’s not pretend that subversion ops would have been as good a name, but at the same time, we’ve actually, we’ve worked involved with many customers for whom using get a hundred percent of the time is actually impossible. And we have, if you look on our, I can’t remember whether it’s in the flux get hub organization or, or one of the Weaveworks orgs, you’ll see something called live get-ups, which lets you do get ups type operations as if on get, but actually using alternative ends like S3 or NCD.
Alexis Richardson 00:13:37 This is useful when you have complex workflows that require you to put things in buckets before they can land in yet. So actually the key concept here is that your model is under version control and, and fully declarative. And then it turns out that good as a very good place to do all those things. It also has really nice security properties compared to some of its predecessors, which means that you can be pretty sure about who’s changed. What in your application, if you use get security properly, it also has good, a competing number of good implementations. We have customers on get hub. You have enterprise get hub for business, get lab, get lab, CE, get lab E you know, get Taya, uh, do it yourself, get a curial get at last year. And the whole thing is lasting good. I mean, rather than material, sorry. Um, Bitbucket as it was, you know, that one, that was the one I meant. So the full range of different implementations is there and that means you can have very high availability, reliability, extra security features if you need them. So all of that hangs off the back of the concept of get being relevant.
Robert Blumen 00:14:52 Your point being get is very easy because it’s one syllable to pronounce, but the concept is that your infrastructure lives in some kind of version repository.
Alexis Richardson 00:15:04 I think it’s choosing between materials and version of get isn’t just about one being easiest to use because it’s monosyllabic and widely and very popular. It’s also because they aren’t identical in terms of features and, uh, get to R seems to be the one that, that has the most capability and supports the widest range of use cases that, that we need. So that it’s that reason. Plus plus the other one is sort of popularity and sibling.
Robert Blumen 00:15:33 Yeah. So that, that’s a pretty good coverage of the first of the two, four laws I’m ready to move on to the third one, which is about poll requests. Would you like to cover?
Alexis Richardson 00:15:47 Yeah, sure. So approved changes can be automatically applied to the system. So there are several underlying concepts here that we can surface. Let’s talk about approved changes. First approval means the potential for a human being or maybe an agent acting on behalf of the human being in some sort of more than one step workflow. And this reflects the fact that for many people doing management and deployment or even releases into a running system, all of these things typically involve a consensus between different actors, responsible for different elements of production, such as you know, the application coder, their buddy. And that might be a manager who needs to approve a change. I spoke to a really large bank in Australia that was automating all of their continuous delivery workflows around mobile banking apps using Kubernetes and using get off. So using other tools and the cloud, and they had to do one extra thing, which was, they had to file tickets in Gyra.
Alexis Richardson 00:16:53 Every time it changes made, which would then cause an email to be sent to the marketing team. Somebody from marketing had to go into Gyra, look at a screenshot of the new application change and approve that. And that was a manual step that was part of their flow. They also wanted to make it possible that if they were concerned about security, they could pull the big emergency security lever and suddenly a whole lot of additional policy gateways would be put into the, into the pipeline so that checks could be made. So not everybody wants full automation. So we need to have room for manual approval. And again, one of the things that’s nice about pull requests is you can have people approve them in both technically with, um, somebody else merging your, uh, thing, or also with the whole social things around good looks good to me.
Alexis Richardson 00:17:41 And all of the metadata around data is very important for recording in your audit trail. What happens to the application when why to make this changed? And then the automation part, you know, the way that get ops works is, is very important to understand that you’re running agents inside the cluster, which can automatically apply changes once they are aware of them, which means that, and you can also tell them to be manual if you want. So you can force the synchronous change. But typically what we see as people let the so-called reconciliation loops run inside the cluster and they see the potential changes landing in repos. And then they go, ah, changes here. I need to apply a change that change can typically be an application update for a deployment and they go, now I’m going to load up the new container and change the config and the running system. And then Hey Presto, the running system will converge to the new state. So that automation is key.
Robert Blumen 00:18:37 If I’ve understood all of that, we have a development process that the organization would define. It might include things like peer reviews, different groups have to sign off change ticketing. That process constitutes approval. Your change. You’re a developer, Alexis, your change makes it through approval. It’s now approved lands in get, then there’s an agent on the other side, which says, aha, I see a change ID agent. I’m going to apply that change to production. Did I get that?
Alexis Richardson 00:19:08 You did. And then that’s a simple example. You could have another example, one level more complex, where the changes get merged and approved and pushed into to get. And now they’re ready for deployments. So you’ve got a new conflict file and you’ve generated some new images. Maybe now you might, before you do a deployment, you might do some kind of preflight check, which could be an automated approval step, uh, which could be a policy check on the, on the Yammel. It could just check that it’s well formed or it could check that it doesn’t have any non-permitted words in it. Uh, cause you might be looking for certain, uh, signs of some security breaches happened. You can also even run do post-deployment runtime checks. So what what’s a good example of a predeployment check. You run, you do a CVE check on your image. How have I got I’m running on a J for, I got a factory would be a common image hub for this check that it has no CVS before I deploy it. That’s a policy check representing an approval can be automated post deployment. You’ve probably seen that in the world of Kubernetes where we lean into automation, we can start containers and then they can thrash because there’s something wrong in the container and actually wants to notice there’s something sick about your container. Shut it down and go back to where you were. So you can have post-deployment checks to also run the look for bad behavior after the diplomas and stuff. All of these things are elements of an approvals process associated with a deployment.
Robert Blumen 00:20:32 Are those approvals, do they take place on the software front end the process? And they’re part of the definition of what constitutes an approved change or does the agent do those after the change has landed in
Alexis Richardson 00:20:46 Get usually it’s actually some by the agent on your behalf typically. So in terms of implementation, if
Robert Blumen 00:20:53 You’re running flux as an agent, it will do these, some preflight checks for you. For instance, now th th the agent is going to try to apply this change. Can that fail, or what are some of the failure modes? Things that can go wrong on that step?
Alexis Richardson 00:21:09 That’s an interesting question. So the most common form of failure that we see as partial failure, and that occurs when you have a multitude of changes that sweats at which each one of which has, let’s say atomic, and some of them succeed, and some of them do not. So the overall transaction fails and it sees it’s not atomic because it’s not even partially correct. And then one thing that you used to have to do, if you were trying to drive changes in say an external Australia like Jenkins or something, is that you weren’t sure what state you’re in because you weren’t basing your deployment on comparing the desired state with the observed state. But instead just on a list of actions, you would have to go back to the beginning and run your whole deployment again from the start. Sometimes you’d even trash. The entire cluster go back and redeploy the cluster, just do everything clean from the beginning.
Alexis Richardson 00:22:02 And that would be a large time consuming operation that could take 30 minutes, 40 minutes, 50 minutes, more and lead to everybody having to down tools while it was being done with get-ups. One thing that’s quite cool is that some of these failures get picked up during the process of deployment because the loop is running continuously. And the agent says, wait a minute, that last change that I made, didn’t go through correctly. I’m just going to keep telling Kubernetes to converge onto the correct state. And it’ll continue to iterate forward pushing the running system to the intended state for you without firing an error. And you can set up different constraints and parameters to, uh, so the system lets you know, when it’s having trouble like that. So that’s a pretty common source of failure. Another pretty common source of failure is when you’ve got a system that’s in the process of crashing and you’re trying to fix it.
Alexis Richardson 00:22:57 So, you know, we use, we have a running 24, 7 running SASCO weave cloud and we deploy everything. And it used to get off including a really cool tool called cortex, which is a multi machine Promethease farm. So pretty complex data oriented Kubernetes application. We sometimes see a congestion or delays or packets being dropped as metrics come into our permittees as a service piece of weave cloud. And then we start deploying changes to fix that using isn’t get offs. And if, if you try to do that at the same time as the world is collapsing because your system’s getting overloaded and you’ve got cascading failures, then you know, you can sort of imagine that it’s like a disaster movie at that point
Robert Blumen 00:23:38 For this to work your ally on a property that it’s very much discussed in the weave works blog post called the convergence mechanism. Yes. I think you’ve referred to that just slightly in your last discussion, but could you delve into that a bit more?
Alexis Richardson 00:23:56 Sure. So this concept of turn vergence is based on the idea that it’s also called reconciliation where you have a and I think for me, this is the fundamental, most important piece of automating operations is that you have the ability to compare the model. So-called the desired state with what’s actually happening right now, which is the observed running state. And if they are the same, then your system is, Hey, I’m happy. I’m in my desired state. My desired state was to be sitting in my chair with my desk in front of me talking to Robert I’m happy now, or my desired state is I’m a Kubernetes cluster. I need three machines, you know, 40 containers, 10 routers to rabbit, Tomcat, something, Ruby, something else, you know, on the other hand, what if you’re not in your decides that we would call out a diff? So there’s a difference between some aspects of the model and some aspects of the running state that could be the number of machines what’s on the machines, the cluster, the size of the cluster, the Kubernetes version.
Alexis Richardson 00:25:00 It could be, you could have the wrong version of Kubernetes than from the one you desire. It could be how many nodes you’re on, how the network is configured, what the network rules are, which applications are deployed, how they can communicate to each other, whether they’re using the latest container for the application and then how the traffic rules are set up. And so everything from top to bottom dashboards to all of these things can be compared to your entire world can be compared between the desired states and the running state. And so then if there is a difference, we need to find a way to get them to be the same again right now in the old days, once upon a time, we would get them to be the same again, by redeploying, from scratch in the new world, you have these agents that are able to inform different parts of your distributed system.
Alexis Richardson 00:25:44 Hello, miss the redness or, uh, you know how I read this, please, can you change your settings to be more like this file? Not like the file that we seem to think that you’ve got that or hello image, you need to be replaced by an updated version of yourself, which is hasn’t got a CBE on it. And so convergence is when we gradually replace bits of the system that are wrong with bits of the system that are right and grow. And then the, the S the number of diffs and the size of the diffs as it were decreases to zero. And once we’re back to the state of having no diffs between our desired state and our observed running state, then we get converged to the correct state. And, and obviously the opposite of that is divergence or drift. Now, this all works because of two things.
Alexis Richardson 00:26:29 One Kubernetes is an eventually consistent conversion system that drives its state to the correct state based on what’s in NCD. And then you have external tools like flux and all the other family of get ops tools, which helped Kubernetes to drive the correct state based on what’s in your external config and your external images and everything else. That’s outside of Kubernetes as well. That’s being dealt with in this way. So Kubernetes is the kernel of it, and it’s the core thing that makes it possible to do it for a lot of the stack. But you also have other tools other than Cuban that is like flux and flagger, which are driving to inversions for other parts of the system. We even created a virtual machine technology called ignite, which makes firecracker look like a Docker container. And one cool thing about that is that to get ops technology.
Alexis Richardson 00:27:21 So you can drive the state of your VMs using Infor config and get, so there’s this, you could do this in some of the different places. Amazon, Google, and Azure have done it for their cloud services with CRDs and so much more of a science. So convergence in summary is a all powerful tool. If provided that you can have a description of the system and you can actually compare it with the running state, then you can use convergence. So as a result, I would say one day, lots more things were done in this way. And also just in case anybody listening thinks that I am, I’ve lost it completely. I’m aware that convergence is not enough. Uh, when you can’t describe the whole system that occurs when you have legacy systems like, you know, updates to side-effects or things like service. Now, external systems that need to be notified of changes is a very important case.
Alexis Richardson 00:28:15 And the other one of course is the classic observability problem, which is nobody can predict complex great failures in a distributed system. And with this tool, what we’re doing is we’re trying to provide a reproducible correct state for systems where we can predict where the likely changes are going to be using the model. And of course, for, for no system, is it true that all changes are predicted? In fact, most of the most, most of the most exciting operational challenges come from unpredictable changes. And that is not something that’s very well covered by, by good officers. Typically you need debuggers and observability tools like traces to dig into those kinds of hours.
Robert Blumen 00:28:58 I think we’ve had pretty complete coverage of what is get-ups end to end. If I could compare this then to what are some other ways out there it’s fairly common. You would see a Jenkins server or some kind of build scripting, which would try to do the entire process end to end what you’ve done with get, as you put this, get repo in the middle, where you have one process that gets things into the get repo. And then the second piece being the backend with the agent that is going to try to apply it to production. If I’m a think of a, well, we have a Jenkins server, we have a build script. What’s wrong with that? Why should I switch over to get ups? Why is this an advance?
Alexis Richardson 00:29:48 I mean, it depends what you’re doing with that Jenkin server typically. So, you know, another way of updating Kubernetes is manually. You just thought cube couple up to come online and we use Jenkins to script manual changes and group them together and do them automatically that scales find works well for small, simple changes, but as you do more and more larger scale changes, then more errors can prepay. And, and you’re more exposed to the problem of a partial failure and a large-scale rollback. This also is difficult when you’re doing fleets of clusters. So let’s say I have 20 clusters with 20 different changes. Using Jenkins for that is, is a bit risky because, you know, by the time you notice that something has failed on the 20th cluster, things on the first cluster have drifted off completely into a different world. So what you really want is a paralyzed process where each cluster looks after themselves and get ups is great for that.
Alexis Richardson 00:30:44 So I would say there is no hard line, but I would say that as you scale more and want more reproducibility, more assess, and team automation get ops becomes more valuable to you. Um, just as there’s no hard like dividing line between using an Excel spreadsheet and a database I’m talking to you from London, where we recently had a political scandal around the tracking of, uh, personal data for people being tested for coronavirus. And the government decided that somebody working for the government and they, and the health service decided in their wisdom, it would be great to store all of this in Excel, not just in Excel, but also in columns. So when they got to now with modern Excel, you can do millions of rows, but only something like 60 or 65,000 columns. So the spreadsheet run out of cells because they ran out of columns and then they were like, oh, now we, now we, now we can’t do it store any more tests in this sheet.
Alexis Richardson 00:31:40 And so the system suddenly collapsed and we added a lot of fun newspaper articles about it. And everybody said, why are these people not using a database? And the answer was well because somebody at the time was in a hurry and they thought they would do this for 30 people. And an Excel spreadsheet was fine for that. And then some other person carried on using the same sheet and before, you know, it that’s, that is the it, you know, and so, um, you know, Jenkins is a bit like that too. Uh, therefore I suppose it’s reasonable to say, try to make good decisions at the beginning of your it journey, not in the middle, but of course we know that most people make changes while the plane’s in the app. Um, and that’s just is how things are today. Here’s a blog
Robert Blumen 00:32:24 Post on your site called how to get ops, help us recover from a total system wipe out. Oh yeah. Are you familiar with that?
Alexis Richardson 00:32:32 Yes. I think I might have written that. So this is where kind of, this goes back to why we talk about gifts in the first place. So, um, I’m, uh, we’ve worked on with the co-founder and the CEO, which means that, you know, I am a business person by, by, by sort of responsibility, if nothing else. Um, thinking about the strategy, vision, customers, growth, and so on and less about technology perhaps. And so one day, um, we’ve been through a number of stages of, of going to market and building the SAS app that I mentioned. We have numb today. We sell an enterprise products as well. If you’re listening, we’ve given it, his platform, it solves get ops. And given that it’s problems, but this is many years ago when people were talking about Docker, Kubernetes, Mesosphere, and ECS is, is four different worlds and wondering which one was going to win and so on.
Alexis Richardson 00:33:23 And we we’d made an early bet on Kubernetes to back-end our SAS, which provided management and monitoring for container apps in the cloud. And, you know, we thought that would be useful cause we thought we would learn about creative and that is and become good at operating it through running the SAS, as well as helping customers with what was in the SAS app itself. And, you know, one day I remember very clearly it was a spring day. The sun was shining through the windows of our office. The birds were singing outside. This is when we had offices and birds were in London. And um, one person in the office said, oh, I’m about to make a change, which I think might wipe out our entire production systems in all of our clusters and zones. And, um, I was sitting about 15 feet away from this person.
Alexis Richardson 00:34:10 And then we kind of went into matrix bullets, time, slow motion, where I’m reaching out across the table to say no, but by the time there was this sort of made its way out of my mouth. I heard a click, uh, followed by a, uh, middle English curse words. And then, um, oh, I’ve re rev wiped out our entire production systems. And so then, um, what was interesting was I just stress, I am not an engineer watching as a number of people who appear to have been ignored this, and clearly hasn’t been stopped doing what they were doing and immediately went into recovery mode. Um, you know, the sort of disaster squad was formed. They went into, they’d had already got a plan for how to recover and rebuild the entire system from scratch, uh, rolled it out of a multiple zones and even connected up to our data and had a live system going in about 45 minutes.
Alexis Richardson 00:35:05 And, um, I thought that was pretty impressive at the time. This is like 2017 or something. So I said to them, what, why, how, how did you do that? I mean, that seemed pretty neat. Does every week, can everybody do that? Um, and it turned out that the people who built kind of core of the recovery tools and systems had been very familiar with these infrastructures code approaches and worked at Google in particular, where they’d been taken to a lot of the next level, especially with these reconciliations loops and conversions tools and how they were tied up with things like the use of get the metadata, the insistence on recording everything in, in, in, in, in version control and monitoring security, being baked into everything as well. So that, you know, the way that the system was was deployed from scratch was literally, uh, following a sequence of instructions that could each be audited as well and get.
Alexis Richardson 00:36:00 So actually they were used to, um, disasters and recovering from them. And I said, well, what were the things that you needed to do in order to make it possible for your system to work in this way? And they said, well, we described everything and get everything is declarative. And we do these other things as well, which means that we can rebuild the system from scratch or fix failures. And in fact, Alexis, when we do an update, like a continuous deployment, we don’t use CIA to do that. We use automated tools that watch what the CIO is generating and then react to what the CIO gives us and put that into the running system. I thought, wow, that’s pretty neat. I think we should talk about this moment. And we spent months talking about what was really essential and kind of boiled it down to these kind of core principles.
Alexis Richardson 00:36:47 And I was talking with the team about it one day and I can’t remember if it was me or somebody else who said, you could sort of call this good offs. And I thought, yes, that’s a good name for this thing. We’ve got a tool that was built for developers being used to drive operations. That’s a kind of neat thing. And let’s talk about it and see what happens. So I wrote about this thing called operations by pull requests, which really plays to the workflow. It was a source of operational change, but actually underplayed the really important reconciliation dimension and then wrote a number of other blog posts based on what I was learning about how the team did operations. And because we’ve worked, we believe in dogfooding everything. We started publishing our open source tools, which we are using to manage the infrastructure, do the continuous deployment, do the progressive delivery. And that’s all you are today. And people are using them everywhere. That’s great
Robert Blumen 00:37:41 In this story from not having a system at all, to having the system, that’s a diff it’s not fundamentally different than I want to change a port for networking or release a version of an app. Is that correct? It’s correct. You’ve talked about some of the open source tools you release. And we’ve talked about this software agent that is doing the application of the DIFs. Do you have some tooling in that area that other people can use for?
Alexis Richardson 00:38:14 We do flux CD or flux as we call it is a good place to start. Flux has now evolved into flux to powered by something. We call the get ops toolkit. Another tool I recommend looking at is flagger for progressive delivery and canaries and AB testing and roll outs. These tools are described on the we’ve worked, we’ve got work site. There’s also a great community resource called get-ups hyphen community dot get hub.io, which is full of information, help and education about this as are many of our we’ve worked blogs to, um, in terms of other tools. I mentioned a few earlier in the podcast. Uh, if you get ups into Google and management or continuous delivery, you’ll see a bunch of things pop up. One thing I would watch out for is although get lab and get hub and get are really useful for doing good offs, just using those is not enough to do good ops because they’re not operational tools. So it really comes about through the, the harmonious conjunction of the operations tools like Kubernetes and flux with the developer workflow tools, like get live and get home
Robert Blumen 00:39:21 Where, or what does flux run on? Is it another container in your cluster or do you have to set up some infrastructure where that run?
Alexis Richardson 00:39:31 No, it just lives in your class though. It’s a little tiny agent. I think the container is 40 megabytes. The next size up is I’ll go CD, which I think is 1.3 gigabytes. So flux is comparatively lightweight, right?
Robert Blumen 00:39:43 If you have multiple you’re either different environments like a staging and a production environment, or you had different deployments in different geographic regions, would you have one instance of flux running in each discreet
Alexis Richardson 00:39:59 Deployment? That’s a common pattern you can do that. I think there are ways of doing this with a smaller number of luxations. So single cluster and some notion of promotion, but I’m not sufficiently technical to describe those to you right now.
Robert Blumen 00:40:13 All right, fair enough. There is a theme in some of the content on your website about compliance, how does get ops play in a world in which there is a regulatory or audit trailing type requirements?
Alexis Richardson 00:40:32 So, one thing that I’ve talked about a lot on this podcast with you, we talked about together is deployment management, agility, reproducibility, scalability. These are all things that matter on the operations side of the house. I’ve actually making, you know, if you’re responsible for running an app usable useful, what if you’re responsible for keeping a record of what changes are made to the app, or what if you’re responsible for keeping a security barrier between production and development? What if you’re responsible for introducing rules that people have to comply with? Because you’re in a regulated environment like connected cars or healthcare or banking, all of those things, security policy compliance and audit are also provided for by get offs. Isn’t that great, but how does it work? The answer is that when we state that all changes must be approved and audited in the versions description of the system, then we’d get an audit trail of changes.
Alexis Richardson 00:41:36 If in addition, we use gets authentication mechanisms and gets metadata mechanisms. So we can get a secure audited authenticated irreputable cross signs record of changes to our system, supported by context in the form of metadata. People have already been using this to go through things like SOC two. If you’re running a big website, SOC two is one of the things you may have to go through if you have personal data. And also now with things like GDPR and the con the American versions of that in California, we’re seeing other demands on recording what you did to our website, the audit trail and get is in many cases sufficient to recording the changes that you have made. Another thing that we do in get in some implementations as the concept of a rights back, so that if you make a change, the agent can tell the dev system that it thinks that change is completed.
Alexis Richardson 00:42:36 So you get a token of saying, you know, flux that it’s job flux thinks it did its job, or Kubernetes thinks it did its job. So all of those things can be very useful for audit and compliance. I would say that some people have gotten away with no longer buying $20 million compliance systems through using this instead, which I’m very happy to hear. But again, I don’t want to over promise because the right industries where compliance is a runtime matter, not a post time matter, something you need to, to apply in flight to things, you know, using policy checks and gateways. Also in some industries, it’s a very complex. So let’s not pretend that it’s the solution to everything.
Robert Blumen 00:43:15 You didn’t have a infrastructure as code, then regulator would say, can I see that you are applying all these regulations? You would have to look at the system, maybe look at config files or asked someone, did you apply all these checks? This moves the discussion, but let’s look at the code and we can prove that everything in the code has been applied to discussion of code. And perhaps these additional tokens that flux told you at apply the code. That should answer the question.
Alexis Richardson 00:43:49 And, um, flux doesn’t take care of this, but Kubernetes, our back does and get our back. Does we might want to know who made the changes. And if we’re in a large system, let’s say we’re running a thousand clusters and 2000 apps and their microservices connected up in funky ways. We might care a lot about whether Robert’s team is allowed to make changes or not, or whether Alexis esteem has to be phoned up to make those changes and stay, you know, or if there’s a complex change involving two teams making changes at different times of day. And so compliance can be, you know, arbitrarily complicated
Robert Blumen 00:44:25 If you’re talking about which team can make, which changes that lives in the front end process, where you’re defining what is an approved change, is that correct?
Alexis Richardson 00:44:35 Correct. But then recording that, that change was done in the right way at the right time. So, you know, we might want to set a policy that says Robert and Robert alone can make this change on Tuesdays between noon and 6:00 PM Pacific time. And then we apply that policy check at runtime, but we also have another check, which is looking at whether Robert was responsible for updating that poll request and how that poll requests got merged, what time of day and what happened next. So these things all constitute evidence of your role as Robert and interacting with the system. And so it’s a combination of policy and audit that give you compliance.
Robert Blumen 00:45:15 Got it. Now it looks would cover it, all my prepared material. Are there any key points about get ops that you’d like to get across to our listeners that we haven’t covered?
Alexis Richardson 00:45:24 There are just too, I mean, one is that, you know, we’ve talked a lot about, you know, how it’s different or not from things, you know, other ways for changing systems. It’s about scalability. It’s about security. It’s about operability automation, all these good things. The other thing that we haven’t talked about, but I think it’s incredibly important is it makes a lot of things easier. You know, we started hiring developers and how they were operating our SAS and making changes to it with no training in Kubernetes. Imagine that, you know, that’s supposed to be a bit of a, a goal for people using Kubernetes. I want it to be just be boring infrastructure. They need to understand it. And then you’ve got another problem, which is, I mentioned earlier the concept of a machine learning platform, as an example of something you might run on Kubernetes.
Alexis Richardson 00:46:07 If I go to a large bank and I say, what do you want to use Kubernetes for? They don’t have one use case. They have 20 use cases, machine learning, other data analysis, back ends of trading systems, batch processing, you know, it goes off. And so imagine being one of the staff in one of those organizations, you walk in in the morning to go, you sit down in front of your laptop and you’re in your study now. And you say, I want to have a cluster to do machine learning on what you don’t want is just a Kubernetes cluster. What you actually want is, Hey, ready to use data science cluster. So there’s a concept of a data science platform, which would be a preloaded set of ad-ons for that cluster. How do we manage those? Ad-ons consistently make it the same platform every time.
Alexis Richardson 00:46:53 How do we patch them, maintain them and therefore support them. And how do we do it at scale, get up sensors. This, it makes it really super easy. And so I think one of the things that’s exciting about get up. So we’re still a bit in the reeds of getting things working is that it just makes a ton of stuff at downside, easier for the regular developer. And for me, that’s the exciting piece. We are empowering developers to do things that used to be the realm of operations, using tools like Kubernetes, along the way, a lot of the problems that Kubernetes puts in your way, because it’s clever, uh, simplified itself by using DevOps. That’s my first point that I think is just so, so, so important and critical. And the second thing is, you know, if you’re, if you’re somebody who’s been doing this stuff since 2007, you know, dev ops, and you’ve heard about configures code infrastructure’s code, it’s like, oh my gosh, this is so boring.
Alexis Richardson 00:47:46 I’ve done all these things to myself. A million times. You know, the answer is actually, we’re finding that more and more different things can be done easily in this way, but I would encourage people to go back with fresh eyes and say, Hey, how is this affecting the class of applications that I can manage as if it was infrastructure as code? Is it making more things easier for my teams? Can I do an S can I manage a fleet of a thousand clusters? How do I deal with multicloud? How do I deal with edge devices? Chick-fil-A restaurant chain in America, the chicken people have, I think, five or 8,000 restaurants, uh, each one with a Xeon quad core Linux device plugged in underneath that the, the counter where you pay, and that is running a Kubernetes cluster in each one of those restaurants and the updates to that cluster and the updates to the, uh, apps on it are done by get ups, um, centralized remote management of an 8,000 TruSTAR, uh, distributed farm in restaurants. I mean, this is mind blowing stuff. So I would very much encourage people to, uh, see it as a technology, just as much of the future as it is a, a recapitulation of some of the more familiar ideas that you may know from the past. If you have been a dev ops person during that time
Robert Blumen 00:49:04 Lexus we’ve mentioned the weave works website, and there’s some free eBooks blog posts there. Where can people find you? Or is there anywhere else on the internet you’d like to direct people. You also mentioned open source a community website, and we’ll put that in the show notes.
Alexis Richardson 00:49:22 We have an upcoming event we’ll get upstairs, which is in mid November. It is not the same day as cube con. Um, but you could, if you look at, get upstairs Twitter feed, that’s where we posting the updates about it. And, uh, last one of those was, uh, in may I had about 1200 people showing up, which is great online events, fun, a good place to kind of come across in here, fresh from practitioners and others. I can be found on Twitter as monadic and O N a D I C or also on the CNCF and we’ve community slack, slack that we’ve done works. I’m happy to talk to anybody about more or less anything if there’s time in the day. And, um, you can also email me I’m Alexis that we’ve dealt with
Robert Blumen 00:50:09 Lexus. Thank you very much for speaking to software engineering radio.
Alexis Richardson 00:50:12 Thank you very much robot
Robert Blumen 00:50:15 For software engineering radio. This has been Robert bloomin.
SE Radio 00:50:19 Thanks for listening to se radio an educational program brought to you by either police software magazine or more about the podcast, including other episodes, visit [email protected]. Provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.
[End of Audio]