SE Radio 636: Sriram Panyam on SaaS Control Planes

Sriram Panyam, CTO at DagKnows, discusses SaaS Control Planes with SE Radio host Brijesh Ammanath. The discussion starts off with the basics, examining what control planes are and why they’re important. Sriram then discusses reasons for building a control plane and the challenges in designing one. They explore design and architectural considerations when building a SaaS control plane, as well as the key differences between a control plane and a data plane.

This episode is sponsored by QA Wolf.

Show Notes

Related Episodes

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:51 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. I’m here today with Sriram Panyam to talk about SaaS control planes. Sriram is the CTI diagnose previously, Sriram has grown and supported multiple high performing and deeply technical engineering teams at Google Cloud, LinkedIn, and several startups both in the US and in Australia. Sri, welcome to Software Engineering Radio. Is there anything I missed in your intro that you’d like to add?

Sriram Panyam 00:01:19 Hey, thanks for having me here. No, you were spot on. I’m looking forward to chatting and sharing and learning.

Brijesh Ammanath 00:01:25 Let’s start with a brief definition of SaaS and its growing market importance.

Sriram Panyam 00:01:31 Yeah. So if you think about your favorite applications, especially in the last 20 years, you had the rise of this whole web 2.0 movement. Actually, let’s go back even before that. You had your traditional enterprise applications. Companies would create something, they would deliver it to users. Users would use it usually with long, long development and deployment cycles. It came with its own costs and nuances. And after circa 2005 onwards, there was a rise of the whole lip 2.0 movement. Where applications would be developed in a more agile way, there would be more consumer focused. And obviously, web being the main delivery mechanism meant that companies could iterate faster, collect feedback faster, and delight their users in a much more, iterate faster fashion. Now, I don’t work for Slack. I’m in no way affiliated with Slack, but I find Slack is a very good example of this.

Sriram Panyam 00:02:32 Your typical chatting applications, WhatsApp, Facebook Messenger, they’re your typical consumer applications. You have one instance as far as the user can see. There’s one giant global instance. You would send messages, you would read messages, you would two other things in those applications. Now, enterprises felt there was a need for those applications within a more closed or bounded domain. How about just messaging within enterprises? How about just messaging maybe within a collection of enterprise or collection of teams? So if you look at Slack, Slack is a classic enterprise SaaS offering or a B2B offering, which is really popular. And it forms a good example of how you differentiate SaaS and non-SaaS offerings. Now, in a SaaS offering, it is really a business model. If you think about what it means to be SaaS, I think there are many definitions, but the key principle is it’s a business model and it’s a delivery model that really is driven by what the business needs.

Sriram Panyam 00:03:40 Technology is common or is used in in most applications. But how is important? One key thing is actually when you want to, I mean a lot of successful companies that offer SaaS products, they believe in the idea that they have to adapt to what the market needs, what the customers need, and what the competition is doing. So a lot of SaaS companies are looking at trying at new pricing models, newer market segments, looking at new customer needs. Now, there’s also the need for onboarding being frictionless. Now, yes, onboarding onto the older or traditional consumer applications was frictionless. You had your Auth, you had your signup sign or login that’s tied to a customer. But here, really your customer is the enterprise. While you may not have freebie and visibility to the end enterprises individual customers, you want to make sure that enterprises themselves can onboard onto your application with the most frictionless way possible.

Sriram Panyam 00:04:44 So this has to be important. You can’t just say, Hey, look, we’ll set up a few boxes with Slack running in a bunch of nodes in your data center manually each time. Can you imagine how long that would take? Can you imagine how long it would take to roll out fixes deploy new, new features? So all this has to be frictionless. And you also have, especially last 10 or so years, regulatory and compliance has been a huge, huge influence in how enterprises want to adopt your offering. In fact, there are so many regulatory environment requirements like sovereign clouds and data residency that demand that their application data compute all reside in a single geography. For example, again, I picked Slack as an example. Slack is owned by Salesforce, which is an American company. Yes, it is global, but it’s headquartered in America.

Sriram Panyam 00:05:42 A government organization in Germany might have strict demands that all instances of Slack are running physically in three or four locations in Germany. So you need to ensure that happens. And again, a lot of the innovation doesn’t just come from the user interface. Those are premium things. There are customer features that do get rolled out, but these kind of compliance enterprise business needs being taken care of is a primary motivation for the innovation. And also the usage scale varies. I think WhatsApp customary competitive offering to Slack, again, not in the same thing, I think does, has about a billion daily active users each sending a thousand, 10,000 messages. I mean, maybe that’s messages a day and they have to be globally available. Like I would have a WhatsApp instance. I would log into WhatsApp, for example, chatting with my family all the way in India or Australia.

Sriram Panyam 00:06:39 And they all have to be available at the same time with something that’s more enterprise like Slack or Slack’s. Enterprise offering those particular global demands could be softened. I might require that my employees are all based in a single geography. So as long as they communicate, I’m good. So these are some of the things that differentiate SaaS versus your traditional consumer offerings and how you build the teams around this. Those are influenced how you build your stack around this that has influenced how you look at metrics, how you look at your product, road mapping, how you look at, I wouldn’t even say culture, like your team culture, all that’s influenced. So that’s why SaaS offerings themselves, SaaS as a business model is growing quite fast. And will be doing so for the next foreseeable future. I think, and these stats keep changing all the time. An interesting stat I found was that in US alone, the SaaS market is around half a trillion annually. And globally, there are between 25 and 50K SaaS companies that are offering various centers services to various enterprises.

Brijesh Ammanath 00:07:47 Interesting. Let’s move into the topic of the session, which is SaaS control planes. Can you give a definition of what a control plane is and why it’s important?

Sriram Panyam 00:07:58 Right. We started with Slack as a motivating example here. And you can think of this for almost any application that an enterprise need that needs. So what is a control plane? If you look back to networking, the terminology arose from the networking era. You had your data centers, there’s data centers would have switches. Switches would connect to N number of routers. And routers would offer a bunch of networks. The idea was you wanted some kind of connectivity from one part of the world. That’s the physical connectivity going through some kind of logical networking to another part of the world. Now, at the start, these are all pretty much physically placed, physically created. My career started off as a network designer in Australia’s largest telecom called Telstra.

Sriram Panyam 00:08:54 And my job was to design how to structure customer racks within a data center for their needs. And a lot of that involved and planning was a huge part of that. You would kind of ask them what the applications were for what was the typical usage pattern of the application, what kind of ingress, egress in terms of bandwidth needs they would need. And you would decide, okay, look they’ll need X number of switches, Y number of routers. This is kind of given this kind of isolation between their own topologies. They might need so and so number of networks. Now, obviously, and this was I think early 2000. As the Web 2.0 movement took on and scale was growing, orders as a magnitude, and I exaggerate on a weekly basis.

Sriram Panyam 00:09:42 Doing this physically or manually was just not possible. Take for example, Google, and this is just me doing back of the envelope numbers. If you had to handle the traffic that Google itself serves, what happens inside Google is actually larger than what happens in all of the internet outside. I mean, if you think of that or put it in the way, Google’s internal traffic, among the services amongst thousands and hundreds of thousand services is greater than the amount of traffic that the rest of the internet sees outside. And that’s a staggering fact. So you can’t provision these networks manually. You have to have some way where these networks can be provisioned declaratively. So this whole idea of a highly connected cross switching fabric came up. And again, as a summary, what this gave you was the illusion of every network being connected, sorry, every node in any network in the world, being connected to any other node almost directly.

Sriram Panyam 00:10:46 It wasn’t directly, obviously. It would be through a bunch of hops, but you would change this network topology using software, and that’s where this whole software defined networking came. And the thing that would change these routing rules, not necessarily on the fly on a second-by-second basis, but on a reasonable timeframe, that stack or that part of the stack was a control plane. So, yeah. So how does all this networking stuff apply to SaaS? I mean, we’re talking about something that’s eight layers above the networking stack. So what does the networking stack have to do with control planes and SaaS. I mean, networking Slack is layer one, two, maybe three. The application is four, five layers above that. Now, the idea is the same.

Sriram Panyam 00:11:31 If you look at again, our favorite example Slack. I think Slack has something like 15 million daily active users as of 2023, 2024. Again, my numbers are rounded up. Now Slack also has about, I think half a million, enterprises on it. 500,000 enterprises roughly. Even if you say that, look, most traffic Slack is going to come from top 1% of enterprises. Now, let’s say 500K, 1% is what 5,000 enterprises are contributing to this 550 million daily active users. Again, these are just my back of the envelope numbers that I’m messaging. So we’re looking at 5,000 enterprises contributing to 50 million daily active users. And even if you say, look a typical active user, if you define an active user as someone sending, let’s say, a thousand messages a day, we’re looking at 50 billion messages being sent a day.

Sriram Panyam 00:12:38 And that comes to about, I think, half a million messages per second. And again, using some very, hand baby math, if you assume that for every message you send it is being read by 20 users, in all these channels you already have for half a million messages being created, about 10 million reads of those messages, that’s staggering per second, by the way. And that’s a staggering number to serve this, you’re looking at anywhere between around 10,000 compute nodes with about 10 terabytes of memory, give or take. Now, more interesting here is that you can say, look, it’s only 10,000 nodes. Let’s just bring up a giant instance of Slack and be done with it. Now imagine 10,000 nodes serving 500,000 enterprises globally. That’s your classic shared model where every enterprise is being served out of the same stack.

Sriram Panyam 00:13:38 Where is the stack running? Is the stack running globally? Is the stack running in some data center in North America? Is it running in some random configuration? Now, we talked about how enterprises have these requirements on how they want their applications to be isolated. And then isolation is the big, big motivation for what we are talking about. If it was a single application cluster that you deploy, create and deploy once, we don’t need a control plane. What customers want is to be able to say, look, I want to stack, imagine if you’re Uber. Uber says, I want to stack, my usage is predicted to be this. I want to make sure that my availability is so and so, which means that if I’m sharing a cluster with 499,000 other users, then it’s pretty much all or nothing availability mode.

Sriram Panyam 00:14:33 If that cluster goes down, every customer’s affected. As we can see, that forms the motivation of why you want isolation. Now, the going the other extreme, if you say that, look, every customer gets their own separate cluster. So these 10,000 nodes are serving you know 5,000 customers. So two nodes for customer, rough hand review math. Then the challenge is, how do you deploy these? How do you deploy these clusters when they’re needed? Again, going back to the old networking model of a new customer comes in, they want a dedicated network. Go and design new switches and routers was great on day one, but now it’s just very cumbersome. So this is where the control plane comes in. The control plane is a piece of software or is part of the stack and shows that anything that Slack is not directly responsible for handling a new customer, it takes care of it.

Sriram Panyam 00:15:30 So what are, what are some of those things? Uber comes in, they want to use Slack. How do you onboard them? Is there a console for them to onboard quickly without having to submit a request and wait few weeks before the Slack team goes and provisions those machines and infrastructure manually. How do you handle any regional requirements? If Uber says, look, I really want to have everything in this region or these regions for, so and so availability, are we expecting them to go and manage their own custom clusters on which was installed? This could be Kubernetes or anything, but we don’t want that. Billing, we talked about 50 billion messages a day. Those that’s not even distribution of messages. If you’re charging somebody for number of messages, you want to actually measure what that’s like.

Sriram Panyam 00:16:24 Or you might just charge for a footprint. And so on. Now, Slack might even say, look, we’ll actually help you manage your user’s identity and accounts and access, ? So there’s some overlap in does that as to whether that belongs to the control plan or the data plan. By the way, the data plane is the application being provisioned or managed or deployed. I think in some places it’s also called the application plan. It is effectively the service that the end user sees. Now, what about things like, do you want to have any other specific tenant provisioning details that you want to abstract away? So this is the control plane. It is like any other service, but it helps build the different stacks and deploy the different stacks and provision different stacks and tenants for the end enterprise customer. That is the key, I suppose, definition, like one key definition to rally around. It has more nuances like how it manages data. How do you get to that ideal state? Where do you start from and so on. But you can think of the control plane as the service or the plane that manages the lifecycle and availability of the data plan.

Brijesh Ammanath 00:17:41 So just to summarize, you started over giving a brief history and how data centers, which is in routers, the complexity was managed using software, and that kind of led to the creation of a controlled plane, which is primarily there to manage provisioning, configuration, user management, charging regional deployments, and so on for the data planes or the applications. Is that a good summary?

Sriram Panyam 00:18:08 Yeah. So the idea of control planes came from the networking world. How you manage those tenant specific non end user specific concerns is what the control plane’s about.

Brijesh Ammanath 00:18:19 Can you tell me a story of how control plane helped manage complexity?

Sriram Panyam 00:18:25 I think I started off on some parts of that in the previous question. So think about what are the, what you would need to deploy Slack for, its customers, and I can talk some of the internal examples too. The reason I use Slack is because it’s a very relatable example that people just get. Well, first of all, let’s look at some of the core concerns that a control plane should really take care of. There are many, but I like think of them as metrics. How do you help shine usage metrics from the underlying service both to the administrators of that service, let’s say Slack, as well as to the developers of the service. So the control plane needs to be able to identify that, look at this instance is being used in these ways, and here are all the rich metrics data that can be captured to shine light on how different tenants are using the system.

Sriram Panyam 00:19:22 Now, you as a service developer can use that metric data to improve various parts of your, under your actual data plan offering. The other one is, how are you establishing the lifecycle of tenants, not just creation. You want to have what are called the crude operators on tenants that create, retrieve, or get update and delete tenants. When you onboard a new tenant like Uber or Apple onto Slack, what do you set up for them before they can start using Slack? That might take into account all their compliance rules. In fact an organization might actually have multiple tenants. For example, someone like Apple might say, again, this is not based on any particular examples, but just general observations around different SaaS deployments. So Apple might say, look, for my AI team, I will need this entire Slack instance for these set of users who are primarily in North America.

Sriram Panyam 00:20:28 That’s one tenant within Apple. Or they might say, one tenant is here, a second tenant could be in Europe only for the legal area. Now, US Slack might think of Apple one customer or one account, but you might decide that they themselves, like allowing multiple tenants to be there for that one customer account is paramount for you. So now your control plane needs the notion of what is a tenant? What is an account? What is an installation? What is a deployment? Now that you’ve created these tenants, they might say, look, I have different kinds of onboarding. I would like to onboard my own user, let’s [email protected] or Brijesh@ apple.com. Using my internal employee IDs. Now, how can I tie up the authentication of those users? Let’s say it’s based on OAuth or TFA and so on before they log into Slack.

Sriram Panyam 00:21:19 Now, Slack as a service might give you those features for enabling different kinds of authentication, but you still have to provision different data stores so that you store that information in compliance with what our Apple needs. And that could mean Apple gets their own dedicated database of user accounts. Whereas somebody who’s a smaller startup with 10 customers might be okay with not having those strict isolation requirements. So when you onboard them, you might say, look, I’ll have 10 instances or 10 different tenants running on the same internal, like my own Kubernetes cluster where I’m deploying Slack. So this kind of managing of onboarding and resources for those on onboarded tenants is, is key. Now, an admin user interface can be two different things here. One is as the overall Slack the company offering. You might have an interface to monitor and observe the different tenant installations.

Sriram Panyam 00:22:16 It could also be an admin interface for the tenant administrator. So somebody at Apple or somebody at your, let’s say diagnosed might be the administrator for their respective accounts. So things like logging and looking at operational behaviors and be able to manage that environment. If they want to upscale, what does that mean? And upscaling could mean, hey, look, I expect that I’m going to have, instead of 10 users, I’m going to have a thousand users. So I’m saving that. Now Slack, you go and take care of provisioning without me caring about those details. So now Slack, the control plane will say, look, now that I know this user, let’s say this user is going from a small, a very small instance of 10 users to a large instance of thousand users. Maybe they got funding, they got acquired, they and so on.

Sriram Panyam 00:23:04 Now, I need to make sure that I move that instance from a shared host to its own, for example, Kubernetes Cluster and the Slack control plane is responsible for doing all that without the end user noticing that this is happening. So now it has to manage this kind of updates, the update part lifecycle. And the other important thing that we talked about is identity, like identity authentication. How do you make it so that the end user does not have to manage these accounts manually, but they can use your offered features as part of the control plane to have a seamless onboarding with an onboarding. And what I mean by that is, there’s the first enterprise onboarding like Apple, Uber level, and then the individual customer, individual employee or user on onboarding. Last but not least, I think billing is a key thing.

Sriram Panyam 00:23:57 Ultimately you are doing a, I mean, you’re selling, I mean, you’re in business because you want to turn a profit. Or you want to have certain growth or financial goals that you want to meet. And without loss of generality, let’s say you want to make money, and ultimately the large part of billing is knowing how you are charging your customers on some metric. It could be based on subscriptions; it could be based on usage. And you want this building to be fair and transparent. If you go back to that V 0.0 0.1 where we said, hey, what we have 10,000 nodes running Slack. Every Slack Enterprise customer is in part of that shade cluster. How do you know which customer had how much usage that you can build them fairly for? So building being robust and available and not being consistent and available is important. So these are the core features that control plane should be responsible for as soon as possible. Now, you can do this in different ways. You can do this through a solid approach, a shared approach, a completely isolated approach, both on the data level and service level, and they have different implications. And we can talk more about that.

Brijesh Ammanath 00:25:15 You talked about data planes. Just wanted to understand, have you come across any instance where the control plane and data plane were not separated out? And how did that evolve over time? Did it need to be separated out as the application matured?

Sriram Panyam 00:25:31 No, this is a great question. Most SaaS offerings start off as a single combined control plane, data plane offering. And what I mean by that is, let’s go back to Slack. Slack on its day one would have, and again, this is not definitely, any offering like this would’ve looked like a giant database where you might have a few tables in this database, like a user table, a chat table, a messages table, and each of these tables would have a dedicated column called tenant ID. Where you might say, for this tenant or this enterprise user, get me all chats, where the tenant ID is this. Now, what happens here is that you have single table and it’s up to the service itself to write the rules or to layer out their business logic to route across different tenants.

Sriram Panyam 00:26:28 And when you’re a new startup, this makes sense because you want to focus more on your business logic. You really don’t want to invest in a separate control plane team to handle these different customers. And part of that is also the business motivation. Because you would start off with smaller customers who are okay to be in this model. If a startup on day one acquired a large customer, then this would be the focus. Then you have your next step where instead of putting everything in a single database, single schema. You might say, look, I have my chats table, I have my messages table, I have my user’s table. Let me create a different database or a different schema for each tenant. So you might say, instead of having a messages table, I’ll have Uber underscore messages or messages Uber as my table.

Sriram Panyam 00:27:21 Or I might even have a database called Uber Database, which will have these three different tables in there. So at the code level, you might say, look as soon as they get a request, I will look at which tenant that user belongs to. Let’s say, use something like OAuth to identify what that domain is and so on. And you might say, every action from now on will go to this database. So my code is lightened at the moment, because I don’t have to choose between database on every operation I make. It has to happen at the starting point. Again, this is great because you have, you’re still sharing resources. You don’t have to worry about provisioning concerns. The only provisioning concern here is, can I create those three different tables in that customer specific database in my DB cluster.

Sriram Panyam 00:28:11 And this will go on for a while. This is fine. The downside is that, again it is shared. So if that database cluster goes down, all the customers go down. Now as you evolve, as you have customers with higher isolation requirements, you’ll start offering, you’ll start looking at, okay, how can I ensure that each customer gets their own tenant, which means that within that tenant, within that service stack or service stack deployment. The code looks at that entire stack as a single tenant. It is not aware of multiple tenants, because why would you. When you have a single stack and is isolated and is dedicated to one customer, it’s that all it needs to focus on. Now, here’s where you start thinking about how do I ensure that a control plane concern is needed?

Sriram Panyam 00:28:54 Because as the number of customers grow, you don’t want to manage these stacks manually. You don’t want to operate them manually. You don’t want to manage them manually one by one. You want to do it in automated fashion. So this kind is a typical evolution from everything in a single namespace or a single shared environment for all customers to, something in between where we have a hybrid approach of some customers could be routed based on schema, and some customers could get their own dedicated clusters, while it’s manageable all the way to a fully solid approach where every customer is either been packed into a shared cluster based on their tier, or get their own dedicated cluster based on their tier and their requirements, obviously their revenue potential too. So, yeah, this is kind of a typical evolution from day one SaaS with built in control plane, all the way to a dedicated control plane team or organization that supports the different products that company might offer.

Brijesh Ammanath 00:29:52 Thanks. We’ll now move to the next section, which is more around designing the SaaS control plane. Can we start off by, walking through a how data movement happens in a typical SaaS setup? And what are the interjections where the control plane helps that data movement?

Sriram Panyam 00:30:12 Let’s see. We caught a few things before in terms of isolation. Yeah. So let’s look at first of all how we want to think about storage and data for your, both the control plane services as well as the data plane needs in terms of storage and data. We spoke about different partitioning models. On day one, you have everything in a single database, single data store, or single data cluster. Or data namespace. And then the software is responsible for deciding which table or even which row to pick based on the tenant ID. And as you evolve to the next level of partitioning, the software has a top-level routing of which database or which namespace to pick. And then after that, you can think about a dedicated database connection that is only for a single database or a single schema being handled by the underlying code.

Sriram Panyam 00:31:04 So in a way, it’s not really tenant aware fully, but it used the different database instances. And then going the full extreme, we are talking about every customer getting their own data cluster or data namespace or database. Now they have like each of these, each of these storage partitioning schemes. Or routing schemes. They have their own approach to on how they can manage data migrations. If you look at the fully independent isolated model, the control plane can help migrate data on a pertinent basis. Because it is either moving an entire database or it’s moving an entire database cluster from one location to another. In the middle case where we said, I will assign multiple, like a unique namespace for every customer, replicating that or moving that out is a relatively easier proposition. Imagine having to filter a single database for tenants by tenant ID when you have to.

Sriram Panyam 00:32:05 That means that you are incurring a load on a single database. Now doing this in a silo, like in a silo approach. Means that you can do a continuous backup of your data or your database for that tenant and simply restart or load from that backup in the event of a handover or failure or transition from leader to follower. So the thing is, whichever strategy you pick, the control plane has to have a certain set of rules on what kind of automation’s running to ensure that this replication, bringing back up, restarting procedures taken care of. And data replication is part of this, disaster recovery is part of this. So this also affects how you have your RPO and audio targets and obviously all that’s impacted by the cost that the customer is willing to incur.

Sriram Panyam 00:33:03 The other aspect of data migration, data movement is security consideration. Obviously, when you have all the data in a single tenant or single cluster in the day one scenario, you need extra, extra security processes. Both at the business logic level, at the access level, in all parts of your stack to ensure that you don’t have data being leaked across tenants. It gets easier as you go up the isolation strategy stack. In the case of multiple databases in the same, or multiple namespaces in the same database, it’s a bit easier. In the case of multiple clusters or dedicated clusters or dedicated tenants, it’s a lot easier. It’s a lot more, easy to ensure that kind of security guarantee. The other part of data management is also billing and how you ensure the kind of ROI I suppose.

Sriram Panyam 00:33:59 When you have a single tenant, sorry. When you have a single cluster where all tenants are hosted, you are saying that the worst-case scenario or the best-case scenario or best kind of instances will be given to everybody. Whereas here, you have an opportunity to give much more fine grain access on giving the kind of instances for the customers. Customers who are willing to pay more, can enjoy better instances or better clusters. Customers who are okay with lower levels of isolation and lower SLOs, they can stay on the shared tiers until needed. So, yeah, the control plane gets more and more robust and gets more and more complicated. Because it has to manage this data movement across tiers, across security boundaries, across isolation boundaries, across regional constraints, and has to do so in a more changing environment. This demand won’t change on a regular basis, but when it does, it has to do it with minimal downtime, with minimal manual intervention and with as quick of a turnaround as possible.

Brijesh Ammanath 00:35:10 Al. Can you talk about some interesting architectural decision points and common patterns used in designing a control plane?

Sriram Panyam 00:35:20 So one thing I can share, we talked about the example of a very large company wanting multiple tenants for their own architecture. Now, if you look at this, the three models we spoke about so far, we said, look on Day 1, a SaaS offering has everything bundled in Day 5 or somewhere in between. It starts to split out the data or the data or some parts of these services into their own namespace. And then you have completely dedicated offerings for each customer. If you were to go the extra step, you can think of this as a control plane of control plane architectures. Now, imagine a very large company wanting their own isolated tenants on their own premises. Now these premises could be actual data centers, or they could be custom cloud accounts. Either customer accounts on AWS or organizations on Azure and so on.

Sriram Panyam 00:36:20 If you look at some of the large-scale data processing platforms, for example, data flow. It would provision an entire working stack or a large part of the provision working stack on the customer’s account. And that means bringing up the compute instances, the storage nodes, the GPU instances and so on the customer’s service account and running the jobs on there. So there is the control plane that clearly orchestrates their instance, and then within that you have a control plane, which is responsible for orchestrating things locally. So this architecture where you have your initial control plane that deploys under the control plane on the customer premise is pretty interesting because youíre really talking about another level of isolation and under the level of control the customer can benefit from. This obviously is pretty, it adds to complexities.

Sriram Panyam 00:37:17 Because in the true SaaS model, you’re provisioning customers offering in an environment that you’re familiar with. The moment you have to transcend that and go to a different environment, it obviously adds more scope for failures, for more challenges in terms of availability, more challenges in terms of being able to observe and monitor, and debug what’s happening on the tenant side. This idea of having control plane off control planes is actually a very interesting design choice. Now, obviously you wouldn’t do that from Day 1, it’s reserved for the ultra-sensitive customers who have those strict isolation requirements even beyond what you want to provide on your own.

Brijesh Ammanath 00:38:04 Can you tell me about any instance or any stories where something has gone wrong and how was it detected and then resolved?

Sriram Panyam 00:38:14 So at diagnosis, a large part of our footprint is around provisioning our software or our offering directly on the customer premises. So we do follow a control plane off control plane models, but at a much smaller scale. Now, the big challenge here is depending on the customer, they might have security regulations and security requirements where they may not be able to share observability data and metrics back to us. At diagnose, we offer tools for running automations for the customers in a much more frictionless way. So when we offer a shared or even a managed offering of that diagnose, it’s easy to debug them because we know what’s going wrong. When customers observe any failures, we can trace through our typical observability stack. Now, when things are going wrong on their premises, it gets challenging.

Sriram Panyam 00:39:19 So what we have done is we’ve actually enabled instrumentation. I mean, like we enabled observability stacks on those offerings as well. But because of challenges in having them export that to us, we made it so that we can only get the observability data from them when and how they choose to send it. So the downside of this is that when failures happen, they will be the first to be alerted. This requires them to have their own observability teams, or at least a small observability team to be on standby when failures happen and we train them so that they can triage these incidents and escalate to us or reach out to us after a certain tier. Now what we’ve done is we’ve made it simple for them to share these metrics to us on a more dial level basis.

Sriram Panyam 00:40:17 So, I mean, they can choose how much they want to share to us, but some customers are more particular about logs because they may hold sensitive information. Some customers are okay with sending everything. So we found that just by sending us traces and metrics, we are able to help them more secure way faster. Customers are okay sending everything even better, obviously, when they share less or they share less, even though they have the choice to do so, they have a higher time to resolution. But that’s as expected from this architecture. So the key here is when we’ve added instrumentation both in the control plane and in the data plane. Or in the application plane so that this instrumentation can be filtered on both sides, both on the customer side as well as on our side.

Sriram Panyam 00:41:06 So they have some guarantee that they aren’t leaking too many things to us, or they aren’t leaking things to us that they wouldn’t want to. And obviously as customers see that, customers that want are okay with this, they can dial this all the way to the, and have a much faster resolution and detection because we are now privy to the patterns of usage and errors on their side. So the control plane, having this variability in how it provisions and what it provisions on the customer stack and being able to upgrade that again with the full control of the customer is a very important choice that helps us.

Brijesh Ammanath 00:41:42 Do you have, or do you remember any observation or any data shared by the customer which surprised you? What were the findings?

Sriram Panyam 00:41:51 Well, I can’t share it. There’s always surprises. There’s are always surprises that turn out to be not surprising once you get to the bottom of it. Yeah, because we’ve had many customers that would obviously see a failure depending on how much they’re exporting to us. We would have visibility into what’s causing it. Again, to keep it at a very general level. We had, I can show you this. One of our customers was using one of the control plane data stores for their own data plane logging. It wasn’t so much a bug as much as a design choice, I guess. And this obviously affected their billing. Because when we build them, the billing was based on usage and not necessarily things like storage metrics. Now, obviously when storage was ballooning because of this work around or flaw, we obviously found a way to mitigate that at that point in time. But also help us learn how we can address the issue of building upfront and what kind of metering has to be in place to catch all the metrics so that, again, so we can provide a fair price to our customers. Again, this is a very simple, this is a very specific example of plane storage leading onto our control plane which we’re able to identify by observing how they’re using it.

Brijesh Ammanath 00:43:13 Are the architectural approaches different for control planes and multi-tenant solutions?

Sriram Panyam 00:43:19 The architectural approaches is different for control planes in multi-tenant solutions? In a way, you are creating a control plane to make multi-tenancy easy. Now we talked about different kinds of multi-tenancy from Day1 to Day 5 to Day a 100. Even that at logical level, the single cluster or single physical environment with all your customers, all your tenants in there, if you think about it, is multi-tenant. Now, the isolation is what has changed. As the offering grows, as the shape of the offering grows, as the scale grows, your control plane is evolving on where it is deploying this logical entity. Now, when it’s deploying yet another table or yet another tenant ID in a single database that your single stack can use, versus yet another physical cluster to be used by a tenant all the way to a dedicated control plane on the customer’s premise, your control plane is going to change.

Sriram Panyam 00:44:25 In fact, your control plane storage itself is going to evolve. You might start putting more and more things in the control plane storage. So that there are different availability guarantees. In fact, you want your control plane to be highly consistent. If you think about the CRUD operations on a control plane, your CRUD operations on a control plane will map to the CRUD operations on the lifecycle of your tenant. Going back to Slack, there are 50 billion slack messages a day. But there are only, what, 500,000 Slack enterprise accounts, even if Slack was growing, let’s say a 100% year on year, you might add 500,000 more Slack accounts or slack enterprises accounts next year. But that is still a tiny, tiny, tiny drop compared to how many messages are being sent by Slack.

Sriram Panyam 00:45:21 So it’s okay for your Slack control plane to have a higher latency, but it needs to have higher availability. So that obviously affects the choice in how you design and what kind of storage you’d use. And when you write to the storage what kind of transactionality you might want to impose at the expense of latencies. So yes, your design choices do change. Your control plane actually does change. But you have to remember, the control plane itself is much lower in footprint than your data plane, and it has to be. You want to ensure that you’re powering a scale that is odd more than what the control plane itself would see. In fact, you want your control plane to be built in such a way that even if your control plane goes down, your data plane continues to operate.

Sriram Panyam 00:46:11 Yes, you might not be able to create a new tenant but your existing tenants are still operating. You might not be able to delete a tenant, okay? That’s fine. You might not be able to change the shape of a tenant temporarily while the control plane is being brought up again. But your data plan has to be operating at a much higher level of availability because that’s what the end user is going to see. So ultimately your control plane has to enable multi-tenancy. That journey from Day 1 where everything is in one place to Day X where you have control planes or some hierarchy of that, that’s an interesting journey.

Brijesh Ammanath 00:46:54 What are the disaster recovery considerations that we need to consider when designing the control plane?

Sriram Panyam 00:47:01 We touched briefly on this, on the data movement migration aspects of this. If you think about a control plane as any other service, after all, it is a service. It’s a service that’s managing the lifecycle of other services. A control plane is going to have its own disaster recovery mechanisms because it’s going to have its own storage and data that it has to ensure. For example, a control plane storage might keep track of what is the application positioning or placement in different regions for a particular tenant. Apple, for example, has five tenants have N number of clusters in 25 different regions, maybe spread out across the three major clouds. So recording all this is a key responsibility amongst many others of the control plane. And we spoke about how it needs to have high consistency and high availability at the expense of latency.

Sriram Panyam 00:48:01 It can trade off latency for availability and consistency. So just like any other service, you might choose how you do disaster recovery by picking one or more secondary regions where you’re doing either real time or some RPORTO based replication. You might be okay if, for example, a company says, a tenant says, I’m okay with not being able to reshape in my Slack instances for three hours. And that kind of forms your soft RTO. Or a recovery time objective. So it has very similar, I mean, the ideas you would pick for disaster recovery would be similar to any other service. Now, if the application, if the data plane has its own disaster recovery requirements. For example, if the data plane or if Apple, for example, says, I want my instances or all my messages to be backed out to be replicated in three different regions in three different continents.

Sriram Panyam 00:49:04 Now you can leave it all to the service to handle, or you could provide certain plugin or pluggable some areas of pluggability in your data plane that can communicate with the control plane to make this happen. So, how the different regions for DR on the data plane are set up could also be part of your control plane concern. So TLDR control plane is a service. It’ll have its own disaster recovery mechanism, but it can also help the data plane with some of those concerns on placement on RTOIPO on setting up the different environments for the failovers and so on. So DR has a lot of similarities, has a lot of differences on what it means for control plane, but if you think of it as a yet another service, it makes the design choices more familiar.

Brijesh Ammanath 00:49:54 Thinking along similar lines, what about security considerations for the control plane.

Sriram Panyam 00:50:01 Security considerations for the control plane. Again, we can talk about the similarities if you were to think of it as yet other service. But one thing to understand is many people when they think about isolation, they fall back to authentication and authorization. This is not a wrong thing when you are in Day 1 and everything is in a single physical environment, because we talked about how the service layer is now doing the routing at the table level. By looking at a wear clause on the tenant. But again, there is very little isolation here beyond some piece of code knowing which entries to fetch in a table. But as you go up that scale of everything shared to everything, being in a hierarchy and control planes or control planes. We are talking about how the control plane enables plugging in of custom and diverse access management controls.

Sriram Panyam 00:51:06 Do you want access management to be tied purely based on OAuth? Where you would log in through your Google account, and if you have a Sri@Apple and [email protected], is that enough? Versus I don’t even want Sri@Apple to be anywhere near the physical, anywhere near a certain blast radius vicinity of [email protected]. So again, you can leave all this other data plane, you can say, hey, data plane you manage which authentication domains to connect to. But the fact that the data plane is even letting you choose between authentication domains could in itself be a major security reflect, at least a security concern as far as the many compliance requirements could ensure. So you might want to say that this stack or this setup or this deployment has to be completely unaware of any other deployment anywhere else.

Sriram Panyam 00:52:06 Which means this deployment is access management hooks into Azure versus that deployment’s access management hooks into AWS’s IM facilities has to be managed, and the control plane is what can do that. And we can extend this example to the control planes vs control planes where you might say that control plane subset X only has access to help you provision on Azure. Control plane subset Y only lets you provision your deployments on GCP and so on. So again, you can expand the scope of the control plane, but it becomes a feature of the control plane now, like a feature of any other service. To give you the fine-grained isolation of the various access and authorization primitives depending on what the regulations and customer needs are. TLDR, it’s a feature, but the devil’s the details.

Brijesh Ammanath 00:53:03 What’s the role of Kubernetes in the design of control planes?

Sriram Panyam 00:53:08 So Kubernetes lets you, not as an expert, but Kubernetes lets you create clusters at scale. With ease. It’s a very simplistic definition. Now, your clusters could be regional, your clusters could be zonal, your clusters could be in different isolation boundaries that you are willing to pay for. The main idea is that it takes away the hassle of elasticity. It takes away the hassle of moving your workloads within a cluster. It takes away the hassle of being able to do all the provisioning that was much more harder and finicky before. It also comes with a lot of challenges. Itís obviously a very battle-hardened piece of infrastructure that has a whole bunch of skillsets that you need. It’s obviously complicated, but all that complexity you have, you’re able to enjoy the elasticity that you don’t have to manage yourself.

Sriram Panyam 00:54:10 Before this, you had to, I mean, even with VMs. You had to go and manage it. You had to observe it, you had to build up your auto scaling groups, you had to take care of a lot of the provisioning and deployment and rollout facilities that Kubernetes gives you out of the box. So if you think about how I would use Kubernetes to deploy either control plane or a stack or a deployment. If you go back to the day one where everything was in a single service, your Kubernetes cluster would actually first of all be an overkill. Youíre using Kubernetes to provision instead of resources, very related resources in a very tight boundary.

Sriram Panyam 00:54:59 Whereas now with managed KS offerings like EKS and GKE and AKS on Azure, sorry on AWS GCPN and Azure respectively, you can create clusters on demand. You can provision your entire stack on them on demand. So the control planeís role now would be to provision these clusters with certain limits, certain resource requirements and constraints as a customer sees fit. These clusters could also be running on the enterprise customer’s on premises. So Kubernetes makes all this easy because it’s a very unified way of having resources and compute at scale with elasticity. So it makes the Cu&D aspects much easier in your control plane that create update and delete aspects. There’s obviously a lot more to what goes on a deployment than just resources in a cluster, but it’s a great way to start off with the resource that you might need without having to incur provisioning delays and manual provisioning complexity.

Brijesh Ammanath 00:56:06 Yep. Got it. Let’s talk about some of the future directions in this space. What emerging technology do you see in this control plane space?

Sriram Panyam 00:56:16 So we spoke about control plane of control plane architecture. The idea really is how do you move the control plane responsibility or control plane benefits, or even its administration closer to the customer?

Brijesh Ammanath 00:56:30 Can you tell us about any success stories that stand out in your mind about using control planes?

Sriram Panyam 00:56:37 Yeah. So Dataflow is a really great example. Dataflow is Google’s data ingestion platform. It’s actually built on top of an internal platform called Flu. And Flu traces back its roots to the original map, use ideas. And Dataflow and Flu are both unified batch and streaming data processing platforms. Now, Dataflow itself is a highly scalable, highly available data processing platform. It processes, I believe something in the order of tens of X & Y of data across thousands of jobs a day. And again, doing very high-level numbers, its own footprint is in the order of tens of thousands of nodes across many jobs that it runs. It’s memory footprints goes to, it is not a petabytes. And this is powered by a very efficient, very scalable control plane that ensures that customer’s jobs actually run on customer’s accounts.

Sriram Panyam 00:57:46 In a highly available and scalable manner, even though it’s a managed offering and not necessarily an open-source offering. Its control plane has been built on years and years of research into high scale engineering. And if you look at other examples, I mean, even a diagnose, we don’t operate at Dataflow scale, our control plane is currently at a more hybrid approach. We are scaling towards offering control planes for our customers on their premises, which allow us to dial how much metrics we can get from the customers to help them at their own behest. And we are obviously growing and learning and applying better ideas as we improve. So again, I guess time will tell on how big and scalable it grows.

Brijesh Ammanath 00:58:38 I think that was quite insightful, Sri. As we wrap up, was there anything that we missed that you would like to mention?

Sriram Panyam 00:58:45 Yeah, there’s a lot of influence and impact on building SaaS products, on how one would structure engineering teams. Now, building a consumer platform or consumer offering, while it’s very involved and complicated. I think there are certain similarities and differences. In both, technology is fast paced, things are moving obviously with AI. There’s a lot one can do in terms of building services fast. Some of the differences could be more consumer environment. You have more deeper placement of skills. You would find that engineering teams are often specialized around certain areas for us, mainly for product engineering teams. Whereas in SaaS offerings, you might need teams that are, they have more expertise in certain domains. You might want to have teams that are very focused on cloud computing or Cloud engineering, security compliance.

Sriram Panyam 00:59:45 And these come together pulling the functional expertise in building SaaS offerings. There are challenges because doing experimentation is a bit more unified for a product, for consumer product. Because you’re looking at how you would take feedback from customer experience in a fairly homogenous way, whereas how your different customers, your enterprise customers use your product. There’s a bit more variation in SaaS offerings. Again, if you look at SaaS offerings, there’s more emphasis on enterprise features like management consoles, billing features, how you do isolation, compliance requirements. Those are a bit more pronounced in SaaS offerings, which may be hidden away from engineering teams, or they are more localized in expertise in purely product engineering teams. And also this is changing these days. The user experience requirements also change a fair bit. And again your SaaS offerings, depending on the kind of product may be more engineering led especially if the SaaS offering is a lot more engineering focused as opposed to dedicated product management needs on a more consumer product. Yeah. And there’s a lot more. But these are the main ones that come to mind.

Brijesh Ammanath 01:01:08 Thank you Sri for coming on the show. It’s been a real pleasure. This is Brijesh Ammanath, for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio 636: Sriram Panyam on SaaS Control Planes

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts

Search

Search

SE Radio 636: Sriram Panyam on SaaS Control Planes

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

SE Radio 672: Luca Palmieri on Rust In Production

Menu

Recent posts