Kief Morris, cloud specialist at ThoughtWorks and author of the recent book Infrastructure as Code, talks to Sven Johann about why this concept is becoming increasingly important due to cloud computing. They discuss best practices for writing infrastructure code, including why you should treat your servers as cattle, not pets, as well as how to monitor your cattle. After examining the benefits — security, auditability, testing, documentation, and traceability — the show wraps up with a look at how to introduce infrastructure as code to organizations.
- Kief’s website
- The Infrastructure as Code book
- Google Site Reliability Engineering book
- DevOps Cookbook
- DevOps Weekly Newsletter
- Testing infrastructure configuration with NMap
- Testing infrastructure configuration with ServerSpec
Transcript brought to you by innoQ
This is Software Engineering Radio, the podcast for professional developers, on the web at SE-Radio.net. SE-Radio brings you relevant and detailed discussions of software engineering topics at least once a month. SE-Radio is brought to you by IEEE Software Magazine, online at computer.org/software.
* * *
Sven Johann: [00:01:17.00] This is Sven Johann, for Software Engineering Radio. Today I have with me Kief Morris. Kief is a cloud specialist at ThoughtWorks and the author of the great O’Reilly book Infrastructure as Code. Today I will be talking with Kief about exactly that. Kief, welcome to the show!
Kief Morris: [00:01:34.13] Thanks, Sven.
Sven Johann: [00:01:35.12] Did I forget to mention anything important about you?
Kief Morris: [00:01:40.00] Those are the key things at the moment.
Sven Johann: [00:01:43.23] What is Infrastructure as Code?
Kief Morris: [00:01:47.28] There are a lot of ways to answer that. One way is that it is the automation in the CALM of DevOps. This acronym coined by DevOps includes culture, automation, learning and measurement. Infrastructure as Code is the automation piece. It’s how people have been doing DevOps for a while tend to approach using tools like Chef, Puppet, Ansible, Salt stack and so on. The philosophy behind it – infrastructure has now become like data, like software, as opposed to being physical things; the physical areas have been abstracted, so we can use these tools the same way that we use software. We can start bringing in tools and practices from software development (continuous integration, test-driven development, continuous delivery, version control systems etc.) and apply them to managing our infrastructure.
Sven Johann: [00:02:51.27] Is there a difference between Infrastructure as Code, programmable infrastructure and software-defined infrastructure?
Kief Morris: [00:03:02.07] Programmable infrastructure, to me at least, and software-defined infrastructure, software-defined networking – those are enablers. Those are the tools. We have a cloud platform like AWS or OpenStack, which means we are able to program these things, as they have an API. Infrastructure as Code is a philosophy of how to go about doing that. The idea of putting your software in definition of files, in version control, run automated testing against them and progress them through environments – they’re all very tightly related.
Sven Johann: [00:03:45.01] Where is it actually coming from? I heard ‘programmable infrastructure’ the first time two years ago. What changed, that this is now becoming an important thing?
Kief Morris: [00:04:03.22] The concepts have been around for quite a while. If we had to name somebody who pioneered this, it would probably be Mark Burgess, who created the CFEngine tool ages ago (the early ’90s). This tool had files where you defined what you want your infrastructure to be like, and the tools apply that and make it continuously so. That’s where I think the ideas come from. Then it was around 2006 or so, around the time when DevOps became a thing, with a label applied to it. It had ways of working that many people had been doing already.
I started working on this book, and I decided Infrastructure as Code was the title I wanted to use, but I wasn’t sure who originated the term. I tried to chase it down, I tried talking to various people who had been involved, and nobody seemed to step up and say, “Yes, I came up with this.” Everybody was pointing me at other people until it went around into loops. The first use of the term that I found was when John Willis – better known as @botchagalupe on Twitter – wrote an article after having gone to a talk, which I think it was Andrew Clay Shafer and Luke Kanies from Puppet. Those two may have been the originators of the term, as far as we can determine, but obviously, a lot of people have done things to popularize it. Probably that’s the time when it came to the attention of a lot of people as a thing.
Sven Johann: [00:05:55.24] When I heard about it, it was more iron age and cloud age. Has the cloud age something to do with it? Or actually, what is the iron age?
Kief Morris: [00:06:08.17] The iron age to me is where our infrastructure was physical. If you wanted to provision a server, that often started with getting a purchase order out to a vendor, having boxes shipped out to you and then bringing them to the data center and racking all that stuff. That made for very long loops between deciding to do something and having it done, and then realizing, “Oh, I forgot something. I needed to add RAM”, “I ordered too much stuff”, or whatever it may be, versus the cloud age, which started with virtualization, even before cloud necessarily.
[00:06:51.21] It was the idea was that you decoupled the idea of a server from the actual physical thing that it runs on, so everything becomes more fluid and dynamic, and you create servers within minutes, if not seconds. The feedback loop becomes very tight, very quick.
Sven Johann: [00:07:11.27] Was there already something like ‘infrastructure as code’ in the iron age?
Kief Morris: [00:07:18.12] Yes, another inspiration I should mention – there is a website called infrastructures.org, which hasn’t been updated in many years now. This is the first place I came into contact with it. It turned me on to CFEngine, as well as some other tools. It doesn’t mention virtualization or cloud at all, because it was pre- most of those things. Virtualization was probably around, but it wasn’t as widely used. This site talked about a way with physical machines reproducibly building new machines, and making sure that they stay consistently configured over time.
[00:08:03.20] This drove me to do things like using tools with PXE boot, so that you could take a physical server, you boot it and it grabs an image over the network, it installs the OS and then runs your configuration tool (CFEngine, at the time) and sets it all up the way that you want it to be, without you having to sit in front of it and select options on the installer menu.
Sven Johann: [00:08:28.19] Why is it good/important to do something like that? I know a lot of people who are afraid of code, they like configuration. Why should I not do that?
Kief Morris: The obvious thing is saving yourself a bit of time and work, and not doing those routine, repetitive things. That’s what machines are for, right? If you’re doing the same thing over and over again, you’d like to be doing something more interesting. Then there’s also the reliability of it. You want to make sure that every time you install a new web server or application server, that it’s done the same way.
Before doing this kind of thing, what we would do is have a checklist, a person would sit down and be like, “Okay, when this comes up, this is the thing I’m supposed to pick.” But then there would always be things that would be a little bit different. Somebody would be like, “Here’s this little trick, this extra thing I can do to make it a little bit better.” Maybe a new update to the system comes out and there’s a new option, so the checklist is never quite up to date, and for whatever reason things are always a little bit different. It’s nice to know that you don’t have to rely on a particular person doing it the same way every time. You just put it into a file.
Sven Johann: [00:09:43.06] That leads us to the practices, right? If I have it in a file, it’s documented, so I don’t have to write the document and say, “Do this, do that…” It’s more how a server is set up, it can be executed automatically.
Kief Morris: Yes, and it’s transparent that way, because anybody can look at that file, and depending on the language and the tool that you use, it should be easy to understand. There’s no ambiguity as to what happens.
Sven Johann: [00:10:18.11] In the beginning you mentioned if we do infrastructure as code, we can apply all sorts of “normal” software development practices. One is we have code, which defines how things should be. What else?
Kief Morris: [00:10:41.15] There’s the testing aspect to it. People who are a bit concerned about allowing things to be done automatically and worrying that these things can be done wrong automatically – which is very true… These automated tools allow you to do a configuration update or patch a whole bunch of machines all at once; it can also allow you to damage a whole bunch of machines all at once, misconfigure many things very quickly. The idea of testing, and particularly automated testing in a test environment – it’s funny, because this isn’t something that we necessarily do as rigorously in the systems administration world as systems administrators; our developers do. We don’t give our developers root access to our production machine to go and make code changes directly to it, but that’s how we tend to do it ourselves on our server configuration.
[00:11:37.01] We might have a test environment where we make a configuration change manually and test to make sure it’s okay, but then we log on to the production server and we make the change manually, and we might make a mistake. What we can do when we’ve got the configuration of our server in files (that’s in a version-control system) is that whenever somebody makes a change to one of those files, commits it to the version control system, we can have a continuous integration server or continuous delivery server (Jenkins, GoCD) automatically trigger and apply those changes to a test environment, and they maybe run some automatic scripts to test scripts. That way if we made a mistake, it should be caught pretty quickly, and we can also make sure then that the development environment that we’re testing against is exactly the same as production, because we’re using those same scripts, because we know there’s no [unintelligible 00:12:25.16] otherwise.
Sven Johann: [00:12:27.18] In my last project we thought we were going to do it like this. We only do a change in code and everything is immutable, but then we quickly figured out it’s better to allow manual changes. Because sometimes you have to do something very quick in production to check out if it works or not. If it works, then you have to apply it in version-control, because every software release, every WAR file we deployed also rolled out a completely new server.
Kief Morris: [00:13:06.05] Yes, it is tempting, and in a lot of cases in practice you do that thing of, “I’m gonna log in on a machine, I’m gonna figure out what it is, I’m gonna make the tweak and fix it to bring it back online.” But you do have that risk that that change is going to get lost, and future servers are not going to have it. Especially, as you say, when you do another release, you will have forgotten it. Because you make a fix in the heat of the moment, and then you forget about it. Then later on it breaks again, and you’re like “What happened? Didn’t this happen last week? What was it?”
Sven Johann: [00:13:43.24] Yes, that happened to us. It happens once, but then it never happens again.
Kief Morris: [00:13:52.10] I think there’s another thing related to that. Because it is so easy to make changes manually – and often times that’s our habit and how we’re used to doing it – it can be painful doing it… Once you set up a CI system with your automation tools, it can take so long to wait. You make a change, you push it, you wait for it to go through, and then you find out, “Oh, it didn’t work, so I have to do it again.” That gets really frustrating. I’ve seen teams lose faith in the whole thing and decide, “Screw it, we’re not gonna do this. We’re just gonna go back to manually, because that’s easier.”
[00:14:24.09] I think the important thing is to focus on improving… When you see these things that are not working for you, or are too difficult, find ways to improve it and optimize it. That might mean using local Vagrant boxes so that you can make your changes and test them and make sure that it’s right before you [unintelligible 00:14:42.16] up, and making the pipeline faster in different ways. You have to keep that faith and discipline to make the system work better.
Sven Johann: [00:14:54.23] Does it also improve security to do that? If every configuration is in version-control and people can review it, you roll it out on different versions, it seems like something which improves security.
Kief Morris: [00:15:19.02] It can do, and it should do. There are some pitfalls there. It improves security because everybody can see it. You can also put in automated tests to test the basic things, make sure there’s no ports open that shouldn’t be open, make sure there’s no accounts that don’t exist.
Sven Johann: [00:15:35.14] How does that look like? I can imagine how I can configure a machine, but how do I do a test against it? Is that possible with the normal tools?
Kief Morris: [00:15:46.27] Yes, I’ve done that with Serverspec for instance, and that can assert some things. You can also use tools Nmap, and there are security tools that you can run automatically. There’s one called ZAP. These are command-line-driven tools that you can put into your pipeline. It’s automated penetration testing.
There’s a pitfall, though. When you’ve got all your configuration in the version-control system and all your servers are built from a CI or CD tool, that becomes a very juicy point of attack for attackers. If I can get into your version-control system, I own you. Or your CD tool.
[00:16:42.02] I was at a talk a few weeks ago at QCon with a penetration tester who was giving some examples of how she went into different networks. What scared me was that her first line of attack was looking for a CI server or Jenkins server. But it wasn’t because she wants to monkey around with your code and insert backdoors, it was just that your passwords are in there. Even inside your private network. It’s so easy to get inside someone’s private network, then you find their Jenkins server, and here’s the password for production database.
You really need to take that very seriously and really put a lot of focus on making sure that you are using these things smart and not opening yourself up.
Sven Johann: [00:17:25.14] I can imagine it’s not only Jenkins. There are a lot of other tools, but I forgot their name.
Kief Morris: [00:17:32.20] In terms of continuous integration?
Sven Johann: [00:17:35.10] Yes, some web-based tools where it can trigger my…
Kief Morris: Is it like Travis CI? There’s loads of them.
Sven Johann: [00:17:44.07] No, I forgot the name. It works with AWS pretty well. Maybe I’ll remember it later.
Kief Morris: [00:17:51.01] Beanstalk?
Sven Johann: [00:17:52.11] No. I forgot it.
Kief Morris: [00:17:54.19] They all have that, though.
Sven Johann: [00:17:59.20] You also mentioned version-control. Basically, everything I develop in my infrastructure is version-controlled.
Kief Morris: [00:18:06.21] Yes, other than your passwords. Don’t put passwords in there.
Sven Johann: [00:18:10.28] Yes, so if software eats the world, we’d better use version control.
Kief Morris: Yes, absolutely.
Sven Johann: [00:18:17.15] Any other important practices?
Kief Morris: [00:18:20.07] I think the promotion one is a good thing; moving things from one environment to the next. [unintelligible 00:18:24.25] I’ve seen people have a Terraform file, a separate file for each environment, and they manually add it. “Here’s the file for the [unintelligible 00:18:33.29].”
Sven Johann: [00:18:37.12] What is Terraform?
Kief Morris: [00:18:38.26] Terraform is a tool. If you’re familiar with something like Popular Chef, where you can configure a single server, Terraform is a tool from Hashicorp. These are the people who make Vagrant and a number of other tools that are quite popular.
Terraform lets you have a file which defines an environment. It works across different providers, like AWS or VMware. You say, “Here’s what my application server coster looks like. It uses this server image, it has these networking rules between them…” So it does that exact same thing – you put them into this file, you run the tool against it and it creates or updates an environment to meet those specifications.
[00:19:22.20] There’s a separate file for each environment, but what you should do is parameterize it, so you have a single file which says, “Here’s the environment for my application.” Maybe you have some parameters that you set for, “Okay, my load balancer pool, my cluster size is different for different environments. My production environment is bigger than maybe some of my test environments.” Then you push that file through each environment. You make sure that it’s run and it’s been tested in the test environment. Then it gets pushed through to maybe a staging environment before it gets pushed through to production. The same file is used in each environment, so you know they’re consistent.
Sven Johann: [00:20:02.21] So the code is the same, but the parameters are different. If I want to set up my load balancers — I want to set up Elasticsearch, but everything I need to set up is for every environment the same. If I do a change I do it only once, but my test environment just has other IP number – these things are externalized in a property file, which then my Terraform uses, right?
Kief Morris: [00:20:39.25] Exactly.
Sven Johann: [00:20:42.28] Anything else?
Kief Morris: [00:20:44.09] As you get into this — it’s just like any application code, in that over time it becomes a bit unwieldy, it becomes bloated, difficult to change and so on. As with software, thinking about how to make it more modular really helps. I think that tends to lead then into thinking about what you do with a monolithic application. These days you like to split it into microservices, which are independent and loosely coupled and focused. You can do the same thing with your infrastructure.
I’m using Terraform as an example. There’s also CloudFormation for AWS and Heat for OpenStack; there’s other tools out there. I think SaltStack has a tool, and so on. A tendency people have is to have one big file that’s got everything. It’s got my Elasticsearch cluster and my web server cluster, my database server and my log stash – everything just in one big, massive file.
[00:21:58.06] What then happens is you end up being afraid to touch that file, because so many things might break. The goal there is to break that down and have separate files. If I’ve got an ELK stack, I have one file which deploys that, and I can progress that through the environments, independently of the other bits. You start thinking about what dependencies there are between those and managing those in a fairly loosely coupled way.
Sven Johann: [00:22:27.17] So if I develop an application, I also don’t put everything in one class or everything in one method, and the same applies to that one for every logical part or component I want to create, I have my separate configuration file.
Kief Morris: [00:22:47.19] That’s right.
Sven Johann: [00:22:48.18] Okay, we briefly talked about why this is important, but I want to dive into it a little bit more. Important benefits of this – we already said it’s more secure and robust, servers are consistent. Anything else?
Kief Morris: [00:23:18.06] If you can do things this way and keep the discipline to keep everything under these automation tools, that consistency gives you a lot of benefits. One of them is it’s easy to make updates and changes.
When one of these big, high-profile security holes comes out, like Heartbleed, and the CIO reads about it in the media and then comes down and says, “I’ve just read about this thing. What are we doing about this?” If you’re doing things in an old-school way and you have a big bunch of servers that are all organically grown and organically managed, the response is usually, “We’re working on a plan for it. We’re gonna pull people off of other projects, and over the course of a couple of weeks we’ll have everything patched up.” Whereas if you’ve got everything under infrastructure as code and all automated nicely, the answer when that CIO comes down should be, “Yes, it’s being done. We’ve got the patch, we’ve rolled it up to the test servers and verified that it’s alright; we fixed a couple of incompatibilities with applications, and it’s under control. It’s got no impact to other projects.” That’s the idea.
Sven Johann: [00:24:37.26] I recently had a discussion about should we do that or not, and one guy said, “We have 500 machines. It would be totally silly not to do that, because we cannot just update 500 machines by hand. We would have to repeat things 500 times.”
I once chased very bad bugs because the machines were configured differently. That’s for me the number one reason to do that, that I don’t have different machines. All machines are consistent. If you look at one machine, every other machine looks the same.
Kief Morris: [00:25:27.20] Yes, and sometimes you can avoid a bit of variability between machines, but if you capture that… For example, if you work with Java you might have some application which you can’t upgrade to JVM (the later version) because if you’re using a third-party application, they haven’t updated it yet or what have you, or there’s some things that we have to change in the libraries and so on. So sometimes you can’t do that, but at least if you capture that, so you can say “Here’s a class of machines. These are my JDK 6 machines, here’s my JDK 7 machines and my JDK 8 machines.” At least that’s exposed in the code, it’s managed, as opposed to just being ad-hoc. Somebody goes in and says, “Don’t upgrade that machine, it will break it!”
Sven Johann: [00:26:12.12] Yes, exactly. Another benefit is traceability and auditability.
Kief Morris: [00:26:20.06] Absolutely. This tends to be important for larger organizations, ones handling money and so on, where you’ve got regulations, or for whatever reason you need a higher level of control; in healthcare, for instance, sometimes you need to make sure that you’ve got a rigorous control over stuff.
It’s funny, because we’ll go into clients with Agile and DevOps, and you get people who are a bit terrified, and are like “We can’t do this because we have to be compliant, and those things mean doing away with discipline and control.” That couldn’t be further from the point. When you sit down with auditors and people responsible for change management stuff, and you start explaining, “This is how we wanna do it”, and it’s very rigorous – no change is going to go into production directly. They all have to go through the full pipeline: it goes into version control, and you have the way something is applied.
[00:27:23.03] When you sit down with an auditor, the typical thing is, “Here’s our runbook, here’s our list of steps that somebody’s supposed to use.” Of course, what’s written down complies with the regulations, but if you look at what people actually use, it may or may not be there. People might take shortcuts, or things run ahead of the documentation, or things change, so people change how they do it without necessarily documentation being updated. But if you can show the auditor, “Look, here’s the script. This is the script. These are exactly the things that are done every time, and here’s the logs. Go into the CD server and it shows when this change was applied – yes, that script actually was applied – and here’s the outputs of each of the things. And by the way, for every change that was made to our production environment, we can trace it back to version control and say <<This is the person who made the change>> along with their comment as to why. <<Here’s who approved it, or tested it.>>” That makes this stuff really powerful, and auditors and change folks really love it once they get their head around it.
Sven Johann: [00:28:23.15] Exactly. It sounds almost stupid not to do it.
Kief Morris: [00:28:25.25] It’s hard, right?
Sven Johann: [00:28:27.09] Yes, I can imagine.
Kief Morris: [00:28:28.12] It’s easy to talk about how wonderful it all is.
Sven Johann: [00:28:29.23] Yes, but doing it, it probably takes a month. You mentioned that we have the cloud age – is there any special benefit for on-demand infrastructure? Can I even do on-demand infrastructure without doing infrastructure as code?
Kief Morris: [00:28:57.08] Sure, you can. You can let people willy-nilly create whatever they want. One of the useful things about on-demand is about enabling and empowering teams, and putting the tools into the hands of the people who need it most. If you have a team that’s managing an application – it could be a third-party application, it doesn’t have to be in-house developed; or an in-house development team – if they have the ability to work out, “So what does my application server need to look like? How much RAM does it need to have? How much disk space? How does it need to be tuned and optimized?” In the iron age approach, that team has to specify, “This is what we want”, and hand it off to another team to implement it.
[00:29:38.29] If it turns out there’s something wrong or different about it – “Oh, actually we needed more RAM/we need less” or whatever – it’s a big hassle. You can also get in the recrimination stuff, of like “We implemented what you specified. Why did you specify the wrong thing?” “Well, we didn’t know.” If you say instead to that team, “We’re gonna specify something, we’re gonna spin it up and try it out, and as we learn, we’ll tweak it and change it.” That team can take the ownership for that. They don’t have to go back and forth with somebody else. That means they really understand what their application needs and how to tune it. Maybe you have to change the code to be more efficient, as they understand the infrastructure resources that it uses. That’s a really powerful thing.
Sven Johann: [00:30:22.08] So in my infrastructure code I would say, “Okay, I need so and so many machines, and every machine needs so and so much RAM”, and then I just try it out, and if I think I need something different, I need less machines, or more machines, or more RAM, or less RAM, I just change it in my code, and automatically I have it.
Kief Morris: [00:30:42.19] Yes, exactly. That’s the on-demand part. The concern there is it being done responsibly? If you have a bunch of different development teams and application teams in the organization, not every team can have somebody who understands various aspects of systems and networking, so having the pipeline and the automated testing gives some kind of controls over that, and says that, “Fine, the application team can go and tweak how many servers they have”, but that will get pushed up to a development environment, and tests will get run.
A central security team can say — we were talking earlier about what tests they might have, what you might use to check whether ports are open… So the security team can implement that and have further development teams take care of it. That helps.
The way a security team or networking team or other team of experts can work as more of a supporting thing. “Here’s some tools that you can use. Here’s some tests to put into your pipeline, and then we can provide consulting, we can come out and help you with stuff where you need extra knowledge.”
Sven Johann: [00:31:55.02] I heard quite often the term “infrastructure as code” in combination with dynamic infrastructure. You cannot have dynamic infrastructure without infrastructure as code. What is dynamic infrastructure?
Kief Morris: [00:32:12.06] Dynamic infrastructure – I like to use that term as an alternative to saying “cloud.” That’s the way that most people think of it these days; it’s like AWS, infrastructure as a service. This idea that you create servers and destroy them on-demand – by scripts or what have you – but you can also do it with virtualization. It’s something like VMware, that doesn’t fully meet the definition of cloud, and that maybe is not shared. You can even do it with hardware.
The organization that’s the poster child for infrastructure as code and DevOps is Etc. Etc don’t use cloud, and they don’t even use virtualization; they use physical machines, but they have the tools in place to automatically provision machines on-demand.
The thing about the dynamic infrastructure that can be challenging is that things are going away and reappearing all of the time. This then changes your ways of managing infrastructure. “I’m gonna log into my load balancer and edit the configuration for a load balancing pool on IP address.” But then the IP address might change, because a server might be destroyed, and a new one might be added. The number of servers in the pool might grow and shrink hour by hour, minute by minute, based on load.
[00:33:40.03] That’s a challenge for a lot of the tools that we have out there, (monitoring tools and so on) which freak out every time a server disappears. That’s one of the big shifts that we’re still going through.
Sven Johann: [00:33:53.27] When I hear “machines appear and disappear all the time”, I also hear quite often the principle “treat your service as cattle, not pets.” What does it mean?
Kief Morris: [00:34:07.05] In the old days, in the iron age, it would take us a while to set up a server; it was an unusual event, so each server was special. I remember we used to have names for our servers, and it was fun to have a theme. We’d name them after stars, or comic book characters.
You would set up your server and it would be there forever, or for a very long time, and you would lovingly care for it, tweak it, and so on. We can’t do that nowadays, especially when you’ve got hundreds of servers, so you have to be more heartless about it and treat them like cattle. You have to accept that any server can go away anytime.
[00:34:53.14] I worked on one project, and one of my friends on that project was a developer and he needed a place to host some artifacts (Java [unintelligible 00:35:01.00]). He found one of the dev servers and he just installed it on there. The next day it got blown away, because that’s what we were doing – we were doing the dynamic cattle, rather than pets. He was surprised, he wasn’t used to working this way, and that illustrated to me the difference; you can’t assume a server is going to be around tomorrow.
Sven Johann: [00:35:26.06] In my own words, dynamic infrastructure means I have my machines (servers) which carry along my application; I just don’t have a name for it, like ‘server 1’, ‘server 2’, ‘server 3’. I just assume they are there, and they can disappear. When they disappear, I just spin up a new machine automatically. I call it ‘cattle, not pets’, because my pet has a name, and cattle are more or less nameless. I have a lot of cows, and if a cow dies – okay, there will be a new cow. Wait, that’s wrong…
Kief Morris: [00:36:20.22] I think part of it as well is this idea that you can’t replace it quite easily. It’s the idea that if your server disappears – whether it’s a deliberate thing or not; maybe we just destroy a bunch of servers overnight to save money, or maybe something goes wrong with it; for whatever reason, it disappears – we know we can get a new [unintelligible 00:36:41.28] of it back without any real effort and without any real knowledge. It’s not like, “Oh, that server’s gone. Sven is the one who knows how to set it up again.” It’s like, “Oh no, that server’s gone… Here’s the button you click to bring it back up again, and Sven has put the knowledge of how to create that server and install the things we want on it into the files and scripts, so it just happens.”
Sven Johann: [00:37:07.17] But if I assume that, I have to do something different with logging, for example. I am the guy who knows, “Okay, something’s wrong. I’ll just check a few machines; I’ll log into these machines and check the log files or check something else.” Now I cannot do that anymore, because there is no machine to log in. Or actually, there is a machine, but maybe the machine which carried the error is gone already.
Kief Morris: [00:37:32.18] Yes, this is the case with all of our data, that we have to worry about what happens to it. Let’s say it’s a database cluster. If we’re writing our applications to be stateless, then it’s all very nice. But if we’re talking about a database cluster where we have logs, metrics, history and everything, you need to have a strategy for handling that. With logs – okay, you push it off to a log server; but what about the log server itself?
[00:38:01.13] A popular approach to this is the serverless approach, which is a way of saying “It’s somebody else’s server”, so I’ll use a hosted monitoring service, like Datadog, or something like that. All my stuff goes off to there, and it’s their problem to worry about how they manage their servers. But for my servers, now I know that it comes back; or I use RDS and AWS for my database, and then I don’t need to worry about those things. But sometimes you do; you might be implementing a service like that, or for whatever reason you might have your own data, so you need to think about “Where does that data live?” In the storage strategy, maybe you use EBS Volumes, maybe you use persistent EBS Volumes – my server goes away, but the EBS Volume is still left, so when I create a new server, I re-attach it.
Sven Johann: [00:38:51.23] EBS is…?
Kief Morris: [00:38:55.14] EBS is Elastic Block Storage, and it’s like a virtual hard drive that you attach to a server when you bring it up. It can be done in a persistent way, so that when the server disappears, that hard hard drive volume, that EBS Volume still exists, so you can re-attach it to another machine when you build it again.
Sven Johann: [00:39:13.23] So everything which is stateless on my applications, they should be able to go away all the time, that’s how I design them. But everything which relates to data, I have to put some thought into it. For files I probably use EBS… And database – to me it sounds the single database is dead. I always need a cluster, because my cluster machines can also disappear.
Kief Morris: [00:39:42.10] Yes, you would want to have a cluster probably geographically distributed. Amazon is really easy to talk about, because people tend to be familiar with it. They have availability zones, which means if you put two different nodes of a cluster into two different availability zones, you know they’re in different physical locations. So if one of them goes away, the other one persists. And backups as well. This is a standard data and continuity disaster-recovery-type stuff.
[00:40:12.13] You need to make sure that you’re backing this stuff up somewhere that’s more persistent. If your entire cluster goes away, you still need a way to build a new cluster and restore a backup to it.
Sven Johann: [00:40:26.27] It’s also what Netflix is doing. They deliberately destroy the availability of something, just to check if that works.
Kief Morris: [00:40:37.14] Yes, the Simian Army – the Chaos Monkey and the Chaos Gorilla, which would just randomly destroy things and prove to themselves that their systems can cope with it.
Sven Johann: [00:40:49.13] I want to make a little move. We talked about principles and benefits, also about dynamic infrastructures and dynamic infrastructure platforms. AWS, for example, is an infrastructure platform, right?
Kief Morris: [00:41:06.00] Yes.
Sven Johann: [00:41:06.26] Okay. What is a dynamic infrastructure platform providing?
Kief Morris: [00:41:15.00] The base-level dynamic infrastructure platform provides compute (servers) – although there’s variations to that when you get into containers – compute storage and networking. Those are the core things which almost everything else is built on top of. If you look at things like monitoring services, there’s some of compute and it has some storage and networking involved in building that service.
Sven Johann: [00:41:43.02] So compute, network and storage. If I understood correctly, these are also the data parts in the world of infrastructure as code, right?
Kief Morris: [00:41:58.11] Yes, you could consider it as data. Well, storage, certainly. A compute is more about how you run jobs and run activities, processing.
Sven Johann: [00:42:09.22] I actually meant something else. The way I understood it, in the world of infrastructure as code we have code, which creates, configures, deletes, orchestrates all kinds of elements, and these elements are data. And I understood it that the data, the code creates, configures and stuff like that, is actually network parts, storage parts…
Kief Morris: [00:42:43.00] You can definitely think of it that way, yes. It used to be hardware and physical things, now it’s data that can be moved around between hardware.
Sven Johann: [00:42:54.22] In my code I have algorithms, and algorithms take, let’s say storage, and they do something with storage. For example in my code I just say, “Okay, from my dynamic infrastructure platform I get my EBS, and I want to connect my server to EBS.” Is that how it works? Sorry, I’m confused.
Kief Morris: [00:43:25.06] That sounds right, if I understand what you’re saying. It’s interesting that in infrastructure as code it tends to be declarative stuff, like Chef, Puppet and Ansible, Terraform [unintelligible 00:43:40.09]. It’s all declarative, rather than being algorithmic. But the algorithms, that part of it is in the tools that apply those, like the Chef client itself, which reads that declarative cookbook recipe and decides, “Okay, how do I make this happen?”
[00:44:00.23] In a way, what we call code could also be looked at as configuration, and the tools that apply it are the software.
Sven Johann: [00:44:13.01] These tools (Chef, Puppet), I need them for configuring my machines and creating my machines, right?
Kief Morris: [00:44:29.03] Yes, they’re the ones that actually make it happen. If it’s Chef, it’s like a recipe that says, “I want this version of Java installed on my servers”, and then the Chef is the tool that consumes that definition/rule and says, “Okay, I’m gonna make it happen. Given a particular server, I’m gonna carry out the actions necessary to install that version of Java onto it.”
Sven Johann: [00:44:56.10] Okay, so we have a platform which is providing everything we need (AWS), we have tools (Chef, Puppet) which configure my service… You also mentioned monitoring. How do I monitor? Can I use something like Nagios to monitor my cloud infrastructure?
Kief Morris: [00:45:21.21] You can, yes. You need some way of configuring your monitoring to know what to monitor. That can be dynamic, as we were talking about earlier; servers might come and go. Whatever tooling you’re using to create your servers probably needs some kind of hooks to then communicate to Nagios and say, “Okay, here’s a new server. Add this to your configuration.” Servers are removed, it would remove it from the Nagios configuration.
[00:45:53.09] One of the things with these tools is that often most of the tools that are out there for monitoring weren’t designed at a time when this was a common way of doing things. They tend to be a little bit static, they tend to not necessarily handle it very well. What you want is an add/remove service, but you also want to retain the data. You want to be able to look back to last week and say, “Okay, that server that’s gone now – I still want to be able to see what its history was so I can find out about a problem that happened a while ago, or I want to find out about trends.” You need to make sure that your monitoring tool is able to do that without complaining about servers that are missing.
Sven Johann: [00:46:33.23] What are the monitoring tools I should use these days? Which tools can do that out of the box?
Kief Morris: [00:46:41.06] People do it with Nagios, people do it with Sensu… There’s a tool that came out of SoundCloud, and I’m struggling to remember the name of it. They’ve created an open source tool – I think it’s Prometheus. I haven’t used it, I don’t have a direct experience of it, but it’s one which I would probably have a look at for my next project when I’m picking a monitoring tool.
Sven Johann: [00:47:06.11] Does Prometheus allow me to see how everything is orchestrated, or which machines depend on each other?
Kief Morris: [00:47:24.26] They can, yes. It depends on how you configure them. Let’s say you have an application which includes a web server, application server cluster, database cluster and so on. You can configure it at that high-level view and say, “Is my application working from end to end?” and maybe it does a health check on the frontend that triggers a run through all of the pieces, to make sure everything is in place and working correctly. Then you can also have it checking the individual components. “Is each application server okay? Is the memory on each application server okay?” and so on, down to various levels of detail. Then the monitoring tool can be configured to know how those relate and to show them in a sensible way. You can start out with a high-level view of the list of applications and then drill down into the different things that make up that application.
Sven Johann: [00:48:24.15] Okay, thanks. I want to make a little shift to a totally different topic within infrastructure as code. It’s more an organization question now… Enterprises to me seem to be afraid of code. A lot of large enterprises like purely configurable systems. Now you are coming and you claim, “We need to program our infrastructure.” We listed all the reasons you should do that, it’s obvious, but what are the typical arguments you hear from enterprises why infrastructure as code is impossible? “It brings no benefits to my organization. It may work for Netflix, but we are a serious business in finance”, I get that quite often.
Kief Morris: [00:49:18.11] Yes, absolutely. There’s different things. When it comes to things like finance and those bigger ones, they’re concerned about security, concerned about the governance stuff, and the way they’re organized has a lot to do with it. They tend to be organized very functionally. You’ve got your database group, your networking group, security group…
[00:49:42.19] I remember one organization I was with last year, I counted probably around nine different groups that had to be involved technically in implementing an environment. Another nine had to be involved in just the governance aspects. We had to change project management, security and risk – there were all these different teams, and each one of them had to be involved. This can make it very difficult in that the way they are used to working is very much by inspect. Before any change is made, they need to see very comprehensive documentation about what that change will be, they need to inspect it and have their input and go around and around.
[00:50:22.12] Then as changes are made, every change needs to be inspected to make sure it was made accordingly. That encourages very big batch changes, where it takes nine months to set up an environment for a new application. When we talk about doing infrastructure and software in a continuous delivery mode, we’re talking about making changes multiple times a week.
The way to approach this – and the way we approached this organization – is to engage each of these teams into the ‘how we built the system’, the pipeline basically to roll changes through, so that the design of that pipeline was something everybody had an input into and had to get comfortable with how it was going to work and how it was going to ensure that the things that they’re concerned about are addressed.
[00:51:13.06] So we got to a point where they’re now doing releases on a weekly basis, which for them is a massive thing to be able to do that, release into production that frequently. It’s a lot of work to get everybody to the table. It’s a lot of work, and it does require a mindset change, of people getting comfortable with a different way of approaching this.
Sven Johann: [00:51:36.17] So if I want to do that, I check all the departments which might have a problem with it, or which are scared, or have concerns, and then I need to find an answer to their concerns and build it in. Then I have a chance doing that.
Kief Morris: [00:51:59.10] Yes, you’ve got to involve them all. It can take a while. Because of the way these organizations tend to be geared up to work – it’s very much in a waterfall way. A lot of these people don’t expect to be engaged until you’re shortly going to go live, like a month before. Throughout this particular project, new people were popping up that we hadn’t heard of, so we were continually having to go back to the drawing board and explain to new people the same things that we had been explaining previously. It can take a while.
The acid test is getting live in production, because in order to get there you pretty much have to hit everybody that has a say in the matter.
Sven Johann: [00:52:44.06] When I do that, moving from classic infrastructure to infrastructure as code and dynamic infrastructure – to me it looks like a large systemic change. I and many others are afraid of large systemic changes, and I would like to move forward in small steps. Did you do that in that large company?
Kief Morris: [00:53:13.22] Yes, that project was one project, not the entire organization. Not everything across the organization has changed according to that. I think the right way to do it is with vertical slices in software – a single thing that has value, that we want to take through to production, as opposed to focusing on “Let’s change the way we do our development environments across the entire organization first, and then we’ll change the way we do our staging environments, or what have you.”
It’s about doing that vertical slicing – getting one small thing end to end. It might involve making exceptions or new ways of working that not everybody else in the organization is allowed to do, but doing with properly, with the proper approvals and all of that. That becomes a proof of concept, and then other groups — or even while we were doing this, we were constantly talking to other groups within the organization which were asking how can they do some of the things that we’re doing, and learn from what we’re doing, and use us as examples when they’re talking to their own change management or security groups, to say “Oh, they found a solution to this problem!” And that helps. That’s how change then starts to percolate throughout the organization.
Sven Johann: [00:54:28.09] I can imagine even in my project – which has also a lot of machines/servers to configure – I could just say, “Hey, I pick one small server (search, maybe) which is not as important as a database or something else”, and then we do it for this one particular part, right?
Kief Morris: [00:55:01.12] Absolutely.
Sven Johann: [00:55:02.12] Okay, that would work. So I have a hybrid approach and say, “I would like to do that for this project, but I understand all the concerns. Here is a way of how to address these concerns”, and we just try it out for one small thing. If that one small thing doesn’t work, or not perfectly works, then okay, it’s only a small thing. If it works, we learn over time and then we roll out the whole idea more and more over the project and eventually over the organization.
Kief Morris: [00:55:35.05] Absolutely. That’s a great way to do it.
Sven Johann: [00:55:37.23] Alright. Is there anything important I forgot to ask you?
Kief Morris: [00:55:44.07] I think we’ve covered quite a lot of stuff. There’s definitely plenty to talk about and think about, but I think that’s a good start.
Sven Johann: [00:55:53.21] It’s all in the book. When we discuss the book… We probably need to do more shows on that topic, because it’s really huge. This one might be just an introduction, but there is a lot of interesting stuff to think about.
Kief Morris: [00:56:12.24] And it’s changing. This was one of the challenges I faced as I was writing the book – stuff was changing. As I was writing, new things or popping up, containers, Docker and all that went from something that I could just mention to something that I had to talk about a little bit more in-depth. Then even towards the end, it’s like there’s unikernels, there’s serverless…
[00:56:33.25] At some point you have to cut it off and get it out the door, but things aren’t standing still. There’s always going to be new and interesting things to talk about.
Sven Johann: [00:56:41.09] What are good sources for our listeners which I could put in the show notes? Obviously, your book, but also websites where people can listen to…
Kief Morris: [00:56:52.10] Yes, there’s a couple other books that are out now. There’s Effective DevOps; I think the DevOps Cookbook is coming out, if I’m not mistaken, very soon. There is the Google SRE book (Google Site Reliability Engineering)…
Sven Johann: [00:57:03.13] The Google SRE book is Site Reliability Engineering…?
Kief Morris: [00:57:06.09] Yes, that’s the one. There’s a number of books coming out right around now which are good. There’s also a book by Thomas Limoncelli and a few co-authors on distributed network management, or something like that, that has a chapter on cloud and things like that. But overall, I think it’s a very good resource for people wanting to learn about systems administration topics in the current environment.
Then as far as websites, I would recommend the DevOps Weekly e-mail newsletter. That kind of points me at a lot of other stuff. I don’t tend to go to any particular sites religiously, but I get a lot of pointers to good articles from there. There is the High Scalability website, which has a lot of good systems administration topics, and again, pointers to other things out in the news.
Sven Johann: [00:58:09.10] And organizational change – is that a topic? Or is it only a topic for me?
Kief Morris: [00:58:13.22] No, it’s definitely a topic. I don’t have off the top of my head good sites to recommend for that, unfortunately. But it’s definitely a topic I read about a lot and follow a lot.
Sven Johann: [00:58:29.07] I mean organizational change in the sense of “How do I introduce that to an organization?”
Kief Morris: [00:58:33.24] Yes, yes. There are a lot of web blogs in the DevOps community which tend to touch on that, because it is so core to it. The automation and the tools on their own – a) they’re not enough to really get you the benefits that you need, and b) you really can’t get enough from them without making changes to the way the organization works.
Sven Johann: [00:58:57.25] Okay, thank you Kief for being on the show.
Kief Morris: [00:58:59.29] Thanks for having me.
Sven Johann: [00:59:01.03] This is Sven Johann, for Software Engineering Radio.
* * *
Thanks for listening to SE Radio, an educational program brought to you by IEEE Software Magazine. For more information about the podcast, including other episodes, visit our website at se-radio.net.
To provide feedback, you can write comments on each episode on the website, or write a review on iTunes. Mention or message us on Twitter @seradio, or search for the Software Engineering Radio Group on LinkedIn, Google+ or Facebook. You can also e-mail us at [email protected]. This and all other episodes of SE Radio is licensed under the Creative Commons 2.5 license. Thanks again for your support!