Venue: KubeCon 2017 in Austin, Texas
Edaena Salinas talks with Nicole Hubbard about migrating a VM infrastructure to Kubernetes. The discussion begins with an overview of the VM infrastructure at WP Engine. Next, they discussed the characteristics that indicated it was time to consider migrating to a container based infrastructure using Kubernetes. Containers are different than VMs. There are pros and cons that were explored throughout the discussion. As for the infrastructure, Nicole explained what Kubernetes is and why they chose to use it. The work involved planning and executing a migration of about 60,000 customers. Other topics discussed were cost benefits and scaling an infrastructure. Edaena and Nicole also talked about the challenges of running an application that has to be available worldwide. Topics include: kubernetes, vms, system architecture, data migration
Show Notes
Related Links
- https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/
- https://www.nextplatform.com/2015/08/06/containers-versus-virtual-machines-when-to-use-each-one-and-why/
- https://www.infoworld.com/article/3173266/containers/4-reasons-you-should-use-kubernetes.html
- https://www.se-radio.net/2016/01/se-radio-show-246-john-wilkes-on-borg-and-kubernetes/
Transcript
Transcript brought to you by IEEE Software
Edaena Salinas 00:00:19 For software engineering radio. This is Edaena Salinas. This episode was recorded at coop con in Austin, Texas coop con is the largest Kubernetes conference to date here. I talked with Nicole Hubbard. Nicole is an architect at WP engine where she focuses on building container based infrastructure automation, and helping teams deploy their applications. Prior to that, she worked at Rackspace on projects to power their cloud servers and on building software to automate the management of their data centers. Nicole has over 10 years of experience developing software and has a passion for building microservices that manage infrastructure. You’re giving a talk here at con and it’s titled scaling to 5,000 unique Kubernetes deployments. And how you’re doing this. We’re going to be talking about this today, but first to give a bit of context, I want you to, does this cry? What WP engine is?
Nicole Hubbard 00:01:30 Yeah, so WP engine’s the leading WordPress digital experience platform. And we work with over 70,000 customers from 130 countries, and we power over 300,000 WordPress installations on our platform, 300,000, 300,000. That’s a lot. Um, and over 5% of the web visits, at least one WP engine powered site every day.
Edaena Salinas 00:01:59 Okay. So this is really good because we get an idea of the scale of the product. Let’s talk about managed WordPress hosting. Can you describe what this is?
Nicole Hubbard 00:02:13 So our co founder, Jason Cohen started WP engine because back in their late two thousands, he was starting his own blog and he was having a lot of difficulty running that. And so he actually started WP engine as a simple hosting company. And then we’ve expanded on that to add additional offerings, things around security and making the platform highly performant. And then also lastly, we added our support layer. So, and over the years we’ve been building on that platform, adding things like analytics and personalization for our customers that help them power, their digital experience.
Edaena Salinas 00:02:54 WordPress is open source wordpress.org. So I’m really curious. Why are people willing to pay to have free software managed for them?
Nicole Hubbard 00:03:06 So WordPress is free. Like a puppy is free and you know, you get the puppy for free, but you still have to feed the puppy and take it to the vet and get shots. And WordPress has maintenance needs just like anything else. Um, so just because you can download a zip file with PHP code doesn’t mean that it’s skills doesn’t mean that it’s secure by default. And there’s a lot of maintenance around keeping an UpToDate following security releases and those types of things.
Edaena Salinas 00:03:37 So when I start getting into a bit of the technical details of WP engine, can you describe its architecture sort of the main components that are higher level first? Yeah.
Nicole Hubbard 00:03:48 So our main components are we have what we call. So we run on top of a virtual infrastructure. You mean virtual machines, virtual machines. Yes. And so in that virtual machine, we run everything. We need to power the sites on that machine. So we have engine X and varnish and my SQL and all the storage for those machines there. Um, we also have another offering, that’s our clusters and it’s very similar. Um, the differences we’re able to run numerous web heads and those run our in genetics and Apache processing to do the PHP processing and our varnish workers. And then we actually pulled the database and file storage off into separate servers that are then shared.
Edaena Salinas 00:04:37 I see. And just before this, we were talking about how people are willing to pay, to have their software managed. What are some of the problems WP engine is solve again, just to give the context before we get into the migration, I want to understand what is it solving security.
Nicole Hubbard 00:04:58 So companies turn to WP engine to help with security, ensuring that their sites online. We have an award winning support team that’s available 24 seven to answer any questions that come up. And so those are the huge advantages that we offer.
Edaena Salinas 00:05:17 What are some examples of applications and customers of WP engine? Yeah.
Nicole Hubbard 00:05:23 So one of our customers is AMD is developer. Central website is hosted with us. They were facing some problems. Uh, their CDN solution did not allow them to host files greater than 25 megabits. And they were constantly uploading files larger than 200 megabits. So they had to turn off their CDN and this caused huge performance problems for them. So they worked to migrate to WP engine to help solve those. Another one of our customers is the CMS. So the country music association, and they actually just recently had their large event, couple of weeks back where. And so we work with them very closely to ensure that they were able to keep their site online during their award ceremony and keeping updates as the award winners were announced
Edaena Salinas 00:06:15 During your time as architect, one of the projects you’ve been focusing on is migrating to Kubernetes. And from what I understand, and from what you just mentioned is it’s built on top of virtual machines or regionally. Can you describe what this BM infrastructure consists of and what its purposes?
Nicole Hubbard 00:06:40 Yeah, so the VM infrastructure consists of engine X, varnish, Apache, my SQL, and some other tools we run internally to help manage stuff. Um, the first step in our migration into Kubernetes is to actually move the PHP processing off of the VMs and into the Kubernetes cluster.
Edaena Salinas 00:07:04 But before we get to that, let’s see, let’s just focus a little bit on this virtual machine. So prior to the migration, the, what used to be the current state of things. So this virtual machines, what were they being used? You mentioned they host Apache and things like that, but,
Nicole Hubbard 00:07:24 Okay. Yeah. So the virtual machines hosts all of our customer sites. Um, most of the sites live on one of the virtual machines, unless they’re in one of our clustered offerings,
Edaena Salinas 00:07:36 What were they indicate? They indicators that showed it’s time to move to something else or that maybe we don’t want to stick around with this virtual machine infrastructure.
Nicole Hubbard 00:07:46 So one of the largest indicators that we saw with our VMs were around the utilization. We had very low utilization numbers as a whole, when we took the aggregate for all of them, um, below 40%. So most of our virtual machines were sitting around very bored, but they would have occasional spikes where they would use all of the resources for each of the VMs, but those never seemed to occur at the same time. And so we thought there has to be a way we can get better utilization.
Edaena Salinas 00:08:23 Is this an issue because you’re saying 40% is being utilized and this costs money. Is that what the issue is? They’re just sitting there. Yeah.
Nicole Hubbard 00:08:33 So we leverage AWS and Google for most of our infrastructure. And so we pay per core and memory that we provision. And for those to be sitting around not being utilized, it means we’re spending money on infrastructure we don’t need. Okay.
Edaena Salinas 00:08:49 And was this the initial infrastructure, the virtual machines, or was there something prior to this virtual machines? So we’ve
Nicole Hubbard 00:08:56 Always been on virtual machines. Um, the way they’ve looked and the way they’ve been shaped has changed over the years, but they’ve always been very similar to what we have now.
Edaena Salinas 00:09:05 And this project that you’ve been working on is, well, first of all, you, you at the beginning said there’s 70,000 customers, I think. And you’re working on moving 60,000 customers from virtual machines to Coobernetti’s. Did you decide to move particular customers first? Like, is there a step where you’re saying, Oh, it’s better if we move X first and why later?
Nicole Hubbard 00:09:35 So we have different levels of plans that we offer to our customers. So we have our personal plans. We have our premium plans, business plans, enterprise plans. So with the migration, we’ve started with our personal plans. Um, these are generally people’s personal blogs. So we will roll that out to a small percentage of them first and validate no problems. And then as we continue in the rollout, we gradually start including the additional plans going up the stack to some of our larger customers,
Edaena Salinas 00:10:11 Because they have specific characteristics, their usage of it, or,
Nicole Hubbard 00:10:17 Um, a lot of times it’s due to the risk profile that those sites are willing to tolerate. So before we w we wanted to keep the risk of our larger sites to a minimum, as much as possible and ensure that we were able to catch any problems before we had major sites on it.
Edaena Salinas 00:10:38 Okay. So here you are talking about moving just one low risk tier of your real customers. Do you also do first, like some test accounts and then just pretend you’re migrating them. They’re not real customers, but they look a lot like real customers.
Nicole Hubbard 00:10:55 So we have a bunch of internal accounts that we use. Um, some of these provide information for our support teams. Um, we have public facing sites that we host, um, one of those being our support garage. And so we actually will, are working to move those sites and we’ll complete those before we move customers. So we like to validate things with internal accounts before we ever put customers on them to ensure that we’ve caught the vast majority of any problems that are going to come up before it impacts the customer.
Edaena Salinas 00:11:28 Okay. And this problem is the reason we’re talking about this is because it’s not unique to WP engine. There are lots of companies that go through these types of migrations, and like we’re talking about moving up test accounts first, and then internal accounts, then real customers from a specific tier and you moving 60,000 of them out of 70001st, are there customers with unique systems that will have to remain in, in the virtual machine infrastructure? Is there such a thing as Oh, because of your type of page that you’re hosting, you should always be number two much. So
Nicole Hubbard 00:12:14 For our initial rollout, we are going to continue to allow our enterprise customers to run on their VMs. Um, mostly because they have larger workloads, they already have web servers that scale up and down based on the request that they have coming in. So we’re going to continue with that for those customers, but for all the others who run on single node VMs, there’ll be migrated into Kubernetes. Okay.
Edaena Salinas 00:12:41 How big is a WordPress instance?
Nicole Hubbard 00:12:46 So it varies based on the installation of WordPress. So they can range from a couple megabytes all the way to gigabytes, depending on how many assets you have, how many plugins you’ve installed, how larger databases based on the post you’ve made in the comments on your site and all of that. So we have things ranging all the way from, in the megabytes, 10, 20 megs, all the way up to numerous gigabytes,
Edaena Salinas 00:13:17 And essentially you’re hosting a word press instance, right? Yes. Okay. About how many would you say you can serve from a single virtual machine
Nicole Hubbard 00:13:30 Due to some of the limits that we’ve ran into around our security modeling? We’re limited to only a couple hundred per machine on our virtual infrastructure that we use today.
Edaena Salinas 00:13:43 Let’s talk a bit more about this virtual machine infrastructure. What are the technical advantages that virtual machines provide plain server applications?
Nicole Hubbard 00:13:55 So when we get to leverage the cloud, which is a huge advantage for us versus having to buy hardware, it allows us to scale up our infrastructure as customers purchase things and without, you know, having server hardware costs,
Edaena Salinas 00:14:13 Which was the initial motivation of why a lot of companies moved to those, right. Because they, they had their own hardware or maybe they were not even creating a business because they couldn’t afford the hardware and things like that. Right. Yeah. Okay. Let’s start talking a bit about this migration from virtual machines to Kubernetes. Do you know more detail about these cost benefits in addition to the utilization?
Nicole Hubbard 00:14:37 Yeah. So as we move into Kubernetes, our target is around a 70% CPU utilization. And so we’re hoping that this is able to actually provide us a significant reduction in our infrastructure costs. Just for moving into Kubernetes.
Edaena Salinas 00:14:55 Let’s do a quick recap of what Kubernetes is. This is what we are going to be talking about later on. Can you explain what Kubernetes is? Yeah.
Nicole Hubbard 00:15:04 So Kubernetes is a container orchestration tool. And so what that means is we’re able to tell Kubernetes what the state of our cluster should be, and it will continuously run and ensure that things are in the correct state using reconciliation loops. What do you mean by state? So the state is which things should be running. Like we want three instances of this application running in the cluster, and we want 10 instances of this application. And so we schedule those in and it then coordinates that and spreads it across all the hardware.
Edaena Salinas 00:15:47 So is the state about the cluster or the application? Because when I first heard this state, I meant, I understood like stayed running state, stopped state restarting
Nicole Hubbard 00:15:59 In this case, it’s mostly the state of the application and the desired state of it.
Edaena Salinas 00:16:06 So by desire state, it would be, Oh, now I need four instances of the application and there used to be one. Okay. One of the things you mentioned about Kubernetes is an orchestration tool for containers. These are not virtual machines. Can you explain the difference between these two?
Nicole Hubbard 00:16:30 So when you start a virtual machine, you get a full operating system with all of the libraries and all of the software and packages that come in and operating system. So this gives you your nit system, your logging via SIS log in Onex. And then you’ve got as to Sage so that you’re able to log in. And then once you have, so after you have all your basic system processes, you then get to start running your processes on top of that. Whereas when you run inside of a container, you’ve got your operating system still that’s running on the host node, but when you start the container, the only things that are actually starting are the application that you’ve defined in it. And it also provides an isolated environment for your application to run in. So say you have two applications that require different versions of Ruby with the same, some of the same packages, but different versions of them and things like that. That can start to be a little bit of a pain to manage all on the same server, but the virtual machine in a virtual machine. Yeah. And, but when you stick those into a container, they get their own isolated runtime environment and it provides that isolation allows them to have only exactly what they need.
Edaena Salinas 00:17:53 Okay. Okay. So you said virtual machine, you’re getting a full copy of the operating system. So every time, so as essentially you’re getting a new computer to put it in simple terms. Yeah. And for containers, there’s one O S that you specify called the host and inside of it, you run multiple containers. And in each container you specify the specific environment that you need for your application to run. For example, in the same machine, you can have a container running Python, 2.7 and another one that’s Python three, right? Correct. Okay. Let’s start talking about this migration from virtual machines to Coobernetti’s. What are some of the other disadvantages of having customers and virtual machines in addition to this cost related problem? Is there anything else like about the OOS and installing things?
Nicole Hubbard 00:18:53 So one of the other downsides that we face is as we continue to grow the number of virtual machines, we have continues to grow. Uh, for example, if you look at my talk, it says 5,000 unique instances. Cause when I submitted it, we had just over 5,000 virtual machines. Um, now we’ve actually surpassed 6,000 virtual machines. And that was in a short amount of time.
Edaena Salinas 00:19:17 Yeah. Was it a couple of months or it was over yeah.
Nicole Hubbard 00:19:22 Less than a year. Okay. Um, and as we continue to grow, when we have to deploy, changes out to all of those servers, it takes longer and longer. And the management of those servers. So we have to manage, you know, S kernel upgrades packages on the servers that have to be upgraded and all of those things, and that provides a lot of overhead.
Edaena Salinas 00:19:48 Is this growth attributed to new customers or just existing customers getting more views of their apps or both.
Nicole Hubbard 00:19:57 So it’s both, um, we have existing customers who continue to grow with us and upgrade to even larger plans that resulted in us building larger infrastructure for them, and then adding new customers as well.
Edaena Salinas 00:20:12 Okay. What would you say the main motivation of this migration was?
Nicole Hubbard 00:20:17 Um, one of our main motivations was to simplify the management of our infrastructure. Okay,
Edaena Salinas 00:20:22 Great. First I would have guessed the cost related, but that’s a good answer.
Nicole Hubbard 00:20:30 The business is all about the cost savings, the engineering departments, all about the simplification. Yeah.
Edaena Salinas 00:20:36 So there’s both benefits. If I talked to somebody in that era, they would say the cost, but simplifying the infrastructure is good too. You have the customers on WP engine that, you know, you’re going to move to Kubernetes. How do you structure this migration plan? Yeah. So
Nicole Hubbard 00:20:54 We’ve written some software that goes through and looks at some of the plugins that are on all of the installations on what do you mean plugins? So WordPress supports installing plugins. So these range from things that add simple functionality or potentially say a different commenting system, instead of the builtin one, all the way to things like woo commerce, which allows you to provide e-commerce through WordPress. Um, so we look, one of the biggest risk is the plugins for the migration. So we look through the plugins that are installed on the sites and we build out the migration plan for plugins that we know don’t have any problems and we’ll work to transition those sites first. And then once we’re, we’re unsure, we do some testing to make sure those plugins work and then we slowly start moving those installations over. And so we’re entirely confident on those plugins as well.
Edaena Salinas 00:21:56 Okay. And the example that you mentioned of a plugin is let’s say I have my own site, it’s a blog. It comes with a default, wait for people to write comments about my post. So I plug in would be a different way to write comments on my site. Yeah. Okay. Why did you mention these plugins are a risk? One of the,
Nicole Hubbard 00:22:22 The largest parts in this migration is we’re moving from Apache with mod PHP, into PHP FPM. And in that migration, we lose the ability to use HT access files. And there are some plugins that heavily rely on that as we found as we started this migration,
Edaena Salinas 00:22:46 Can you explain what the HT access file is? Yeah.
Nicole Hubbard 00:22:50 Certain HT access file allows you to set configuration values for Apache, for the directory that you’re in and the sub-directory support. So one of the largest uses of it is to set, to create rewrite rules. And these allow you to change the way that the URL gets sent to the backend to actually be processed. And then you can also do things like redirects and some other, you know, configuration values.
Edaena Salinas 00:23:22 Okay. So while you were saying, is plugins provide this risk in this case because they’re using a different version of PHP that relies on this it’s
Nicole Hubbard 00:23:34 Because of the use of HT access.
Edaena Salinas 00:23:38 Yes. Okay. So that’s the first step of the migration plan to analyze website or blog and find this dependencies that are the plugins considered dependencies.
Nicole Hubbard 00:23:53 Yeah. You could consider them dependencies,
Edaena Salinas 00:23:55 So you’ll find them and then flag them. If there’s a flogging that, you know, cannot be migrated, then that’s the first step. That’s the first step. Okay. What comes next in the migration plan?
Nicole Hubbard 00:24:10 So during the actual transition, when we bring the site online in Kubernetes, what we do is we do a quick validation of the site before we transfer it to ensure the site is working as we expect. And then we transitioned the site to performing the PHP processing inside of Kubernetes. And then we do another validation to ensure that it’s still responding the same way.
Edaena Salinas 00:24:38 So this is once you’ve migrated it, you do a first validation or where’s the first validation happening.
Nicole Hubbard 00:24:45 So the first validation happens right before we migrate them. Uh, so we validate it and then we migrate it and then we Val, well, we check it, then we migrate it. And then we validate that it’s the same,
Edaena Salinas 00:24:59 What are some of those checks, the first validation.
Nicole Hubbard 00:25:03 So for the most part, it’s loading the homepage of the site and ensuring that it loads without any errors.
Edaena Salinas 00:25:11 Okay. So you’re checking the page loads without a Nero’s before you have done anything to the page. Right. Okay.
Nicole Hubbard 00:25:18 If some, if something about the sites not working correctly already, we don’t want to make that worse by potential by migrating.
Edaena Salinas 00:25:26 Okay. And then what did you say comes after this first validation?
Nicole Hubbard 00:25:32 So after that we migrate them and then we do that exact same validation step again, to ensure that the site’s loading correctly being processed in Kubernetes,
Edaena Salinas 00:25:45 Does this validation needs to be automated or is it in
Nicole Hubbard 00:25:48 Yeah, so we wrote automated scripts around it, and those provide the ability for us to move an entire virtual machine with a simple command.
Edaena Salinas 00:25:58 Oh, okay. And what technology was used for those validation scripts.
Nicole Hubbard 00:26:04 So they are done in Python.
Edaena Salinas 00:26:07 Okay. Cause first I thought this would be like, there’s these tools for testing websites where selenium is one of them and what you can do is it automatically launches a browser, you see it being launched and rendering the page and things like that. So,
Nicole Hubbard 00:26:26 Yeah. So those are really good tools. Um, in this case we actually, so we’re also using, go for this, um, we’ve written us internal tool that we use in go that can actually go through and hit all the sites. And so the Python actually wraps around that. So the ghost script does the actual validation of the site.
Edaena Salinas 00:26:49 Okay. In this validation, are you looking at network traces or no response error code.
Nicole Hubbard 00:26:57 We’re specifically looking at response codes and the page, the actual body of the page to ensure re turned correctly.
Edaena Salinas 00:27:05 Okay, great. And you don’t look at the UI, what it looks like? No, not in this case. Okay. And for the body, it’s the HTML body of the page. What are you validating there that
Nicole Hubbard 00:27:22 Individually, that it’s loading correctly and that there’s actually a body there. Okay,
Edaena Salinas 00:27:27 Cool. Let’s see. What are the specific pieces that need to move from a BN BM to Kubernetes?
Nicole Hubbard 00:27:37 So in our first step into migrating into Kubernetes, this specific piece that we’re migrating is the PHP processing. So we’re moving from running Apache on the virtual machine to running PHP, FPM with a go wrapper around it in Kubernetes.
Edaena Salinas 00:27:58 What did you say is the first, the original PHP version or what was the name? PHP. FPM. That’s the new one, right?
Nicole Hubbard 00:28:08 That’s the new one. Oh, the original was Apache.
Edaena Salinas 00:28:11 Okay. And is there a significant difference between those two?
Nicole Hubbard 00:28:17 For the most part? No. The largest difference is the HT access file that I mentioned earlier, but besides that they both run PHP code almost the same way. So it’s very little difference.
Edaena Salinas 00:28:30 Okay. But why wouldn’t you use the same one?
Nicole Hubbard 00:28:34 So with Apache, you have to configure a configuration files for every site and point it to the directories. And then there’s a lot of work we have to do. And we leverage app armor to provide security. So that one site can’t access the files on another site. And that we’re able to restrict certain things that we need to restrict for security reasons. In the case of PHP FPM, we’re able to start the workers for in isolated namespaces and then Mount the files for that request for the installation in real time as they come in so that they only have access to those files. Whereas that wasn’t something we could easily accomplish with Apache.
Edaena Salinas 00:29:21 Okay. So would you say PHP FPM has sort of wrapped all this notion of config options
Nicole Hubbard 00:29:31 To an extent? Yes. Um, a lot of it’s we wrote a lot of custom go code that handles the responses as they come in before passing them to PHP FPM. And so since PHP FPM uses a fast CGI connection, we’re able to afford the request to it over that.
Edaena Salinas 00:29:52 what do you mean by fast CGI connection?
Nicole Hubbard 00:29:56 Fast. CGI is the implementation are the interface that you use to talk to it, to PHP FPM.
Edaena Salinas 00:30:05 So just to recap, part of the motivation is that you’re not using the same PHP is because of this config file is, and there’s a lot of hacky work. It seems like to secure it and make sure you don’t talk to who you’re not supposed to talk to him. So, so it greatly simplifies engineering process to move to PHP FPM. Yup. Exactly. Okay. As part of this migration, when you’re moving BMS to Coobernetti’s, how do you store customer specific data, for example, is it volume mounts or how does this work?
Nicole Hubbard 00:30:43 So what we’ve done for the initial stage is we leverage NFS to Mount the files from the virtual machine into the Kubernetes pods.
Edaena Salinas 00:30:58 Can you explain a little bit about NFS?
Nicole Hubbard 00:31:00 Yeah. So NFS provides you a way to share a file system across the network. And so it basically allows you to Mount that file system and work with it just like you would any other set of files.
Edaena Salinas 00:31:14 What have been some of the benefits of moving to Kubernetes? Because like we mentioned is container orchestration tool, right? Yeah. So
Nicole Hubbard 00:31:25 Beyond the simplification for management and cost savings, which are two huge benefits, we’ve also been able to start leveraging Kubernetes for our internal applications. So some of our internal microservices and we get a lot of value there in the way that we’re able to deploy things like monitoring services and logging services. We’re able to get automated processes that configure those and read the logs and ship them to our logging destinations and things along those lines.
Edaena Salinas 00:32:04 So it hasn’t increased the speed of which you deploy instances. It does allow
Nicole Hubbard 00:32:09 Significantly faster deploys as well. Uh, and as we continue moving forward with building smaller microservices that are only responsible for one thing that allows us to deploy those in a significantly faster and CICT way versus our older, more monolithic projects, which had to be deployed out to us 6,000 VMs. And we were actually were only able to deploy those once a week.
Edaena Salinas 00:32:40 Okay. Do you have an idea of the cost comparison of running Kubernetes versus BMS?
Nicole Hubbard 00:32:49 So we’re looking at somewhere between 40 and 50% cost savings as we continue to move into Kubernetes
Edaena Salinas 00:33:00 Between 40 and 55 zero. Okay. Wow. That’s pretty good. Have there been unexpected costs along the way of this migration though? Like yes, you get cost savings, but maybe there are some costs for doing this.
Nicole Hubbard 00:33:17 Not really. Um, we’ve ran into a few things where like, we just kind of underestimated the amount of resources we would need, but nothing huge.
Edaena Salinas 00:33:27 Does the customer know there’s this migration happening? Maybe not in migration terms, but do they know something’s going on?
Nicole Hubbard 00:33:35 So as part of our communication with our customers, we’re giving them a heads up that over the, you know, we give, we’re not telling them specifically when they’re being moved, but we do tell them that coming up in the near future, your site will be transitioned to a new backend. But hopefully if everything goes according to plan, it’s completely non impactful for the customer.
Edaena Salinas 00:34:02 Yes. And what do you need to do to make sure it’s not impactful? Do, for example, is it running at the same time in the VM and as in Coobernetti’s or how does the, do you guarantee that it will be seamless to the customer?
Nicole Hubbard 00:34:17 So for the first week after we migrate a site into Kubernetes, we continue to run Apache on the VM. And if for any reason we’re unable to make the request to Kubernetes, or if, for any reason, a customer starts reporting problems, we’re able to switch them back to the Apache backend and allow everything to process just like it was before. Okay.
Edaena Salinas 00:34:43 So it sounds like at the beginning, you, you’re going to have both infrastructures fully running,
Nicole Hubbard 00:34:49 Right. Just for a brief period of time. Okay.
Edaena Salinas 00:34:54 Is that time determined based on if customers are reporting things or is it set in stone? Like for one month we’re going to run both of them.
Nicole Hubbard 00:35:03 So the goal is it’s two weeks, but if customers are reporting, things will obviously extend that
Edaena Salinas 00:35:09 have there been migrations that have failed.
Nicole Hubbard 00:35:14 Not currently, we’ve only moved a handful. Well, we’ve moved our internal stuff and we’ve done some burnin. And so we have a couple hundred sites that are running on it now, and we’ve not experienced any problems.
Edaena Salinas 00:35:33 And even if some of them fail while you have this roll back plan. Right. Yeah. Okay. So it’s looking pretty good as what you’re saying. That’s good. And let’s see. And the duration of the migration does it depend on who you’re migrating, like how big the site is and things like that
Nicole Hubbard 00:35:51 Since we’re only moving the PHP processing and we’re not actually having to move the files or anything along those lines, it actually, the size of the install doesn’t matter in this case. So what happens is we spin up the infrastructure in Kubernetes to do the preach pre processing for that set of installs. And then once that’s running, we reconfigure those sites to then process there. And it’s completely transparent to the requests coming in and everything’s handled.
Edaena Salinas 00:36:25 What are some of the technical aspects of
container research management in Coobernetti’s or do you, do you have to manage the, is there a resource management?
Nicole Hubbard 00:36:36 So Kubernetes allows you to control the resources of the individual containers. So you’re able to limit the CPU and the Ram. And so we have limits that we’ve put in place for each of those to help keep those wherever we need them to be.
Edaena Salinas 00:36:53 And is this automatically handle as you’re scaling? Like does Kubernetes take care of that?
Nicole Hubbard 00:37:02 So Kubernetes won’t automatically set your resources, but as long as you’ve set the limits, you can then use what’s called a horizontal pod autoscaler and that will be able to automatically scale your application up and down.
Edaena Salinas 00:37:18 And do you specify their arrange or something? Yeah,
Nicole Hubbard 00:37:21 You can give it a min and a max.
Edaena Salinas 00:37:23 That’s great. Let’s talk a bit more about scaling and infrastructure in the description of your talk. I saw that you mentioned most organizations only need to run a couple of deployments of their application in Kubernetes. What do you mean by this?
Nicole Hubbard 00:37:43 So what I mean is when you look at, you know, a service that a company writes, they deployed into Kubernetes and they usually run a production version, maybe a staging and a QA and a testing and some dev versions. But overall, you’re not having to deploy the same application over a hundred times. You’re talking maybe 10 to 20 instances of it. Whereas in our case, we have to deploy it thousands.
Edaena Salinas 00:38:15 I see. So this is because of the scale of WP engine. So what you’re saying in general, the majority’s is not this big scale.
Nicole Hubbard 00:38:24 Yeah. That, and you’re able to use, you’ll still have a large scale, but it’s scaling one application really large. Whereas we have to provide for, you know, over 300,000 WordPress installs, we have numerous installations.
Edaena Salinas 00:38:44 Okay. And in this cases, why is it straight forward to deploy to Kubernetes? Is it because it’s just clicking a button or just a simple command?
Nicole Hubbard 00:38:55 So there’s a lot of tools out there that help with deploying into Kubernetes. Helm is one of the big ones. And so how it makes it really easy to deploy an instance of your application. And even when you need to deploy a staging version and a production and or dev versions, it makes that process really easy for you. But when you start having to deploy even a hundred of those, you’re having to manage a hundred different config files for each of those that you provide to helm. And that starts to become a little cumbersome in helm inside.
Edaena Salinas 00:39:32 So then what do you do in that case? Is there another tool that you use or
Nicole Hubbard 00:39:37 So what we did, we actually wrote a tool that we’ve made open source, uh, called . And what it does is it leverages Kubernetes has the concept of custom resources and these allow us to define any, they represent any object that you want. And so we’re able to define our virtual machines as a customer source and then run a helm deploy for each of those virtual machines.
Edaena Salinas 00:40:08 Okay. And are you managing Kubernetes in your own infrastructure or through a cloud?
Nicole Hubbard 00:40:14 So we’re leveraging cloud providers, specifically
Google and AWS. And we’re the Virgin Google’s GKE offering.
Edaena Salinas 00:40:23 Okay. That’s interesting because some of the people that I talk about, they they’re either AWS or Google or only Azure, is there a reason why you’re using two of them? Are you doing things in one of them that you don’t get in the other one? Not really.
Nicole Hubbard 00:40:41 No. It’s not that one of them provides us something the other doesn’t it’s, um, mostly about the availability of regions and what some of our customers demand from and want to have for their provider.
Edaena Salinas 00:40:56 Okay. So one provider might be in other regions, not the other one isn’t but technically they’re the very similar is just about this regions. Correct. Okay. What are the reasons why you would need to similar tenuously deploy 5,000 unique instances of your applications?
Nicole Hubbard 00:41:21 So for our case, it’s specifically around the scale we’re at and being able to deploy and continue to migrate great into Kubernetes. And so to make that process as simple as we could, it was easier not to completely overhaul all of our infrastructure in the first stage. So, but to continue to leverage the infrastructure we had in place and start pulling pieces out of it.
Edaena Salinas 00:41:50 And what do you mean by unique instances, for example, is there a case where you’re deploying simultaneously 5,000, not unique instances of your application,
Nicole Hubbard 00:42:03 Essentially? Um, that would be something that could happen in our case. What we’re deploying is 5,000 instances of the same application. So it’s our custom PHP processing application. And the only differences are some environment variables that held the process where the files are located. So which of the VMs that it’s responsible for serving?
Edaena Salinas 00:42:31 So that’s what the unique stand for stands for in this case. Yep. Okay. What are some of the challenges of having an application that needs to be available worldwide?
Nicole Hubbard 00:42:43 So there’s a lot of unique use cases that you start to run into. When you look at worldwide availability of applications, one of them being, not all countries have the same connectivity and when the request is coming from China and it has to go all the way to the U S to be served. That starts out a lot of latency. I’m so trained to remove as much of that latency as possible. It’s, it’s actually one of our goals in future steps in our migration to Kubernetes is to actually try to remove as much of those network hops as we can. So by making the front edge of our network for all of our customers available worldwide and all of our data centers.
Edaena Salinas 00:43:35 So what you’re saying is to reduce that latency is when you have to use a CDN or not
Nicole Hubbard 00:43:42 Radians can help with this. So as part of our offering, we do provide a CDN that customers can use, but not all of our customers leverage the CDN.
Edaena Salinas 00:43:54 Do you know why is this because of costs or
Nicole Hubbard 00:43:57 Mostly, I think it’s around cost.
Edaena Salinas 00:44:00 Okay. Are there any other challenges of having a worldwide application? Well, those are the big ones. Those are the big ones. Yes. And I guess the other challenges would fall more under the customer’s application. Right? If, if it’s worldwide, we’ll make sure you support different languages and your images don’t have text in only one language and things like that. Right? Yeah.
Nicole Hubbard 00:44:25 So a lot of those challenges fall onto our customers actually managing their sites. But since we don’t have any control over, what’s actually on the sites, that’s entirely up to the customer. What is needed to manage Kubernetes at scale? The first thing you mentioned earlier is to not do it directly on Kubernetes, but fine tools like helm and things like that. Is there anything else that you need to manage it at scale? Um, monitoring is extremely important. Uh, we leverage Prometheus’s with an, our Coobernetti’s clusters to be able to get metrics and data out and then generate alerts. What sort of metrics do you get with from ETS? So it’s able to talk to our applications and read any metrics that they’re exposing to it. So anything from average request, time to slowest request to number of requests, anything along those lines. So these are things that use specifying the application.
Edaena Salinas 00:45:30 Yeah. You get to specify them for your applications. Okay. And well, I guess, can you explain a bit of what says? Is it a wraparound coordinator or so Prometheus’s is part of the CNCF?
Nicole Hubbard 0:45:55 Um, it was the second project accepted by the CNCF and it provides a tool that will go out and talk to your application and hit a metrics end point. So it scrapes the app and it saves all of those into a time series database that you’re then able to query.
Edaena Salinas 0:46:10 Okay. And do you use a separate tool to visualize this data? Yeah, so we leverage Grafana to actually be able to query premiere.
Nicole Hubbard 0:46:25 Yeah, so we leverage Grafana to actually be able to query premiere. Yes.
Edaena Salinas 0:46: 29 And you get up play graphs. Okay. Lots of pretty dashboards.
Nicole Hubbard 0:46:34 Yeah. I know.
Edaena Salinas 0:46:37 And how do you normally decide on which tools to use, like, does it depend on what it costs?
Nicole Hubbard 0:46:52 Is it open source or so we went the route of Gryphon and Prometheus’s specifically for me, he has, because it was able to handle the needs of the infrastructure that we have. We run into problems sometimes where some tools don’t handle the scale that we’re at, whereas Prometheus’s did not have that issue. We do like working with open source tools, but we’re also willing to leverage commercial offerings if they meet our needs.
Edaena Salinas 00:47:25 Okay. All right. Well, Nicole, thank you for coming on the show. It’s been great talking to you.
Nicole Hubbard 0:47:38 Thanks for having me.
[End of Audio]
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].
A little more preparation and thought by the interviewer going in could have resulted in a much better quality episode.