SE Radio 544: Ganesh Datta on DevOps vs Site Reliability Engineering

Ganesh Datta, CTO and cofounder of Cortex, joins SE Radio’s Priyanka Raghavan to discuss site reliability engineering (SRE) vs DevOps. They examine the similarities and differences and how to use the two approaches together to build better software platforms. The show starts with a review of basic terms; definitions of roles, similarities and differences; skillsets for each role, including which is technically more demanding. They discuss tooling and metrics that SRE and Devops teams focus on, including whether custom automation scripts are more a DevOps or an SRE stronghold. The episode concludes with a look at typical good and bad days for DevOps and SRE and touches on career progression for each role.

Show Notes

Related SE Radio Episodes

SE Radio 276: Björn Rabenstein on Site Reliability Engineering
SE Radio 513: Gil Hoffer on Applying DevOps Practices to Managing Business Applications (Devops)
SE Radio 457: Jeffery D Smith on DevOps Anti Patterns (Devops Anti patterns)
SE Radio 313: Conor Delanbanque on Hiring and Retaining DevOps (Hiring and retaining Devops)
SE Radio 268: Kief Morris on Infrastructure as Code (Infrastructure as code)
SE Radio 288: Francois Raynaud on DevSecOps (DevSecOps)

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Priyanka Raghavan 00:00:16 Welcome to Software Engineering Radio, and this is Priyanka Raghavan. In this episode, we’re going to be discussing the topic DevOps versus SRE, the differences, similarities, how they can work together for building successful platforms. Our guest today is Ganesh Datta, who is the CTO and co-founder of Cortex. Ganesh has an active interest in the areas of SRE and DevOps, primarily from spending many years working with both these SRE and DevOps teams and now is a co-founder of a company that develops a platform for the latter. I also saw that Ganesh contributes a lot to this magazine called DevOps.com, where he’s written on topics such as metrics reviews of Open-Source libraries, and also discussing testing strategies. So, welcome to the show Ganesh.

Ganesh Datta 00:01:03 Thanks so much for having me.

Priyanka Raghavan 00:01:05 At SE Radio, we’ve actually done quite a lot of shows on DevOps and SRE. We’ve done a show for example, episode 276 on Site Reliability Engineering, episode 513 on DevOps Practices to Manage Business Applications. We also did an episode 457 on DevOps Anti-Patterns and then there was also show episode 482 on Infrastructure as Code. So, a ton of stuff, but we never looked at, say, the differences between DevOps and SRE and I thought this would be a perfect show to do. So, that’s why we’re having you here. But before we jump into that, I’m going to actually dial it back and ask you if you could just explain in your own words what you think DevOps is for our listeners.

Ganesh Datta 00:01:47 When I think about DevOps, there’s obviously a lot of confusion between DevOps and SRE and there’s people that kind of do a little bit of both. And so it’s definitely a very open term, and I think the one thing that we always to say is, you don’t necessarily to shoehorn yourself into one or the other. There’s a lot of people that overlap, but when I think about DevOps is literally in the name, right? It’s developer operations. It’s everything around how do we increase engineering efficiency, engineering productivity, how do we enable developers to operate and work their best? And that comes down to everything from tooling to pipelines to build systems to deployment systems to all that kind of stuff I think is really owned by the DevOps team. And so, anything that when you think about development team operating their services, like, that is exactly what DevOps falls under, right?

Priyanka Raghavan 00:02:32 And so how about SRE then? What could you say about site reliability engineering?

Ganesh Datta 00:02:37 Yeah, I think it’s interesting because when you think about SRE, they sometimes do a lot of things that DevOps, well you would, you would think DevOps does, around pipelines and things that. But when I think about SRE it’s more from the lens of reliability. They’re thinking about are the processes that we have in place leading to better outcomes when it comes to reliability and uptime and those kinds of business metrics. And so SRE is mostly focused on defining and enforcing standards or reliability, building the tooling to make it easier for engineers to adopt those practices. And I think that’s where some of the overlap comes in. We’ll talk about that later, obviously. But anything that comes from a reliability or post-production lens I think falls under the SRE umbrella.

Priyanka Raghavan 00:03:15 So, there’s also this, I think a couple of videos and maybe articles where I’ve read where they typically define it as class SRE implements DevOps. That’s one thing that I’ve seen. Well, what’s your take on that?

Ganesh Datta 00:03:28 That’s a really interesting way of putting it. I think it’s true to some extent when I think about SRE, it’s when I think about Ops, you can break it down to pre-production, to production, and post-production. Those three are all totally fair parts of the system and I think SRE generally lives in that kind of post-prod environment where they’re defining those standards obviously those are the things you have to build into your systems beforehand. But mostly they’re thinking about, hey, once things are live, when things are out, do we have visibility? Are we doing the right things? And so, I like to think most SRE teams live in that world and so they, it’s kind of SRE implements post-prod ops implements DevOps. So, maybe another tree down where in reality it should be SRE implements DevOps because you should be a) working together and b) kind of working across a stack. So, yeah, I really that, that way of putting it.

Priyanka Raghavan 00:04:16 So, the other question I’ve been meaning to ask is that there’s a lot of confusion in the roles, but you’ve kind of broken it down for us here, but there’s also these other new roles that I keep seeing in many companies. For example, this infrastructure engineering or Cloud engineer, are these also different names for the same thing?

Ganesh Datta 00:04:35 I think it’s another one of those cases where there’s still a lot of overlap. So, when I think about Cloud engineering, it’s almost like pre-DevOps. If DevOps is kind of focused on hey, how do we enable teams to build their code, run their code, get it into our Cloud, deploy it monitor things like that, then Cloud engineering is even more one step behind that. It’s what is our Cloud? Where are we building it? What does it look? How do we track it? How do we, are we using infrastructure as code, setting the true foundations of everything and kind of building those bare bones stack and then everything else kind of builds on top of that? So, I think that’s where kind of Cloud engineering generally ends. And I think Cloud engineering probably has more of that pre-prod overlap with DevOps. And then, SRE has the post-prod overlap with DevOps and so they’re kind of living in similar worlds. But yeah, Cloud engineering in my mind is more truly building that foundation and then enabling DevOps then do their job, which is then enabling developers to do their job.

Priyanka Raghavan 00:05:31 And where do you think these things differ? So, is it just on the environment or anything else?

Ganesh Datta 00:05:37 Yeah, I think it comes down to the outcome. So, when you, when you think about building these teams internally, I think you had to take a step back and say what exactly are we trying to solve? what is the desired outcome? If your desired outcome is, hey our developers are not setting up monitoring correctly, they’re not, maybe their pipeline doesn’t have enough automation for setting up that kind of kind of stuff. We have uptime problems, okay, you’re thinking about reliability, you got, you need an SRE team, right? Even if there might be some overlap with what the DevOps team is doing, if your desired outcome is reliability, that’s probably going to be your first step. If your problem is hey, we’ve got stuff all over GCP, we have things on app engine, we’ve got things on Kubernetes, we’ve got RDS, we’ve got people running things in Kubernetes, okay, you got to take a step back and say okay, we have, we have a weak foundation, we need to build that foundation first. Okay, you’re probably going to look at Cloud engineering and then you say okay, we know we’ve kind of invested in our Cloud, we have some idea of how we’re doing it. It’s just really hard to get there. We have Kubernetes, that’s our future. But, for a developer to build our deployment, get into Kubernetes, monitor it, that’s going to be really hard. Okay, you’re probably thinking about DevOps. So, I think taking a step back and thinking about what is the end goal that will answer the question on what do you need today?

Priyanka Raghavan 00:06:48 Yeah, I think that makes a lot of sense. So, I think sort of understanding your outcome defines your role is what we get from this.

Ganesh Datta 00:06:56 Exactly, and I think that’s where a lot of teams struggle is they don’t have those clear charters, and I think the more clearly you can define the charter and say this is what success looks for a team, the better those teams can work. Because yeah, DevOps is a very broad space. SRE is very, very broad. And so even within that I think you have to kind of give people that charter and say this is exactly what we care about. Is it, we want more visibility? We don’t necessarily have uptime issues, but we don’t know if we have uptime issues. Okay, then your charter is going to be a bit different. It’s enabling monitoring and observability versus hey let’s put together SLOs and create that culture of monitoring excellence. So, even within that there’s different charters and you have to be very intentional about what that charter is.

Priyanka Raghavan 00:07:34 So in your experience, what do you think about the team sizes then? Would that again depend on your charter? Would it go back to that and then you decide?

Ganesh Datta 00:07:44 Yeah, I think it really depends on the charter. I think, you probably want to start with smaller teams to begin with. You don’t want to just bring on a team of 10 SREs and then say okay you guys are just going to go do everything because then that A causes thrash for the SRE team but then also thrash for the development teams because they’re saying, hey, everyone’s asking something different of me. I have no idea what I’m doing. So, be very intentional about what your charter is and then that kind of dictates your team and obviously that charter might change over time, right? if you start today with, hey uptime is what we really care about, we have problems with that reliability, okay, you have a small team your standard three to six people maybe kind of focused on that and then you have some other issues around observability and monitoring, maybe that team kind of splits in half and focuses in on it.

Ganesh Datta 00:08:25 And then you can start kind of growing that team and have a team dedicated on observability and monitoring. And you kind of see this, I know organizations that have been doing SRE for a while, you look at startups that have maybe about a hundred to 300 people on engineering team. You see one dedicated SRE team that just kind of does everything. But you look at companies that have more established SRE foundations and you have, you see head of reliability, head of observability, and even within that you have people that are kind of running those individual charters. So, I think obviously teams are not going to get there immediately, so don’t try to do everything all at once and build out too many teams, start small and kind of figure out where your weaknesses are and hire around that.

Priyanka Raghavan 00:09:01 I think that perfectly explains what we see. So, I think it’s, if you’re more mature as an organization, you could probably spend more time in reliability and things like that. Whereas if you’re really just starting up, then maybe your foundation is not good enough to actually even know what you need to be looking at. I think that probably makes a good segue into our next section where I wanted to mainly talk about, say, tooling the metrics and maybe the role challenges. So, let’s jump in. The DevOps role, like you said is something that comes earlier in the life cycle, in the development life cycle. So, can you talk a little bit about the tooling? You have this built pipeline automation, you have the CICD tooling, so what is all that? How does that play with these DevOps principles?

Ganesh Datta 00:09:45 Yeah, absolutely. I think one of the principles that I think is common across everything is kind of like the whole idea of don’t repeat yourself, basic software engineering practices and not so much even from the DevOps team’s own code, but more from an engineering standpoint. So, thinking about tooling, I think obviously it starts with your source control, right? Every team has to kind of make a decision on that. You’re probably, if you’re hiring a DevOps team, you’re probably far enough along where you’ve kind of tied yourself to some version control system or another. But I think that’s where it really starts, right? So, what is our basic set of practices that we want to enforce across our version control? do we want pull requests, approvals enabled for everything? Do we want protected master branches? Things that.

Ganesh Datta 00:10:25 what, and maybe you’re not going to define this upfront, but you might set that as a long-term goal. Say, if we do everything correctly, we can now get to this place where people are shipping faster, they’re merging things or approvals are happening, whatever. So, I can set that goal. So, it starts with version control. And then once you have that version control stuff set up, then it comes down to even dependency management systems. So, are you using an internal artifact? Are you using GitHub packages? Are you, are you using any of those because you don’t really ship any libraries internally, what is your artifact store internally? So, kind of starting with that immediate stuff. And then you’re going to think about not just dependency management systems, but then the actual build pipelines and things Jenkins, get up action circle, CI, what are the requirements there?

Ganesh Datta 00:11:05 And so this is an interesting part because I think the DevOps team also all most, not just thinks about tooling, but they need to be kind of product managers in some sense where they the thinking about, hey, what are the things we need in order to support the rest of our organization, right? It’s, do you want to, do you have the capacity to build paralyzation and caching and all this stuff yourself into your build pipelines? If not, okay, maybe, maybe you’re not going to go with something as bare bones as Jenkins and you want to buy something off the shelf, right? So, kind of figuring out what is a use case? What kind of tools are we building? Are we building lots of really heavy DACA containers? Are we just building small JavaScript projects? What is the standard thing you’re doing?

Ganesh Datta 00:11:42 Because now you’ve got your kind of build pipeline set up in place and then your build pipeline is obviously going to do a bunch of stuff, right? It’s you’re probably going to do, you’re going to run tests, you’re going to ideally take those, those that test coverage and, and ship it off somewhere so you can track that. So, you’re going to probably own a soar sense or something, something similar to that. You’re going to also have whatever your Cloud engineering team if, they exist and if they’ve built something whatever that pipeline is to get things into that system. And so, thinking about that infrastructure there, thinking about, uh, alerting and incident management. So, if builds are failing, is that something that’s alertable? So, are you going to be integrating with your incident management tools, sending that information in there?

Ganesh Datta 00:12:20 Are you going to be integrating with Slack or Teams or whatever to send information to developers about those builds? And so all these kinds of things that are think are part of that process is definitely not necessarily owned by DevOps, but it’s something that they need to have a lot of say in and say hey, here’s how we’re going to be consuming a lot of those things. And then, and this is where we’re kind of inching into more of the observability and monitoring space is obviously you’re observing and monitoring your actual build system and pipelines all the tools that you run, but also things build flakiness and those kinds of metrics where you want to be tracking and giving them visibility. And so, you have your own things that you’re going to be trying to get into the monitoring world. And so, I think this is kind of the general stack that I think most DevOps teams are working with.

Ganesh Datta 00:12:58 And so kind of thinking, going back to what I was talking about, don’t repeat yourself. I think as a DevOps team is looking at this entire stack, they should be thinking about, hey, how do we abstract away a lot of our stack and make it easy for developers to consume it, right? So, maybe you’re not opinionated on when things send Slack messages, but you want to make it easy for teams to say okay, if I want to send a Slack message from my pipeline, here’s how I do it. And so, can it give them the tools to do those things that A, makes it easy for developers, but B follows your own practices so you are not maintaining now 15 versions of a Slack messaging system as sending messages over, right? So, you want to keep your own life easier. So, I think DevOps teams as part of their stack should be thinking about design principles and things that as well because it’s going to make their life hell in the future if they don’t do that from day one.

Priyanka Raghavan 00:13:42 Yeah, that really rings very close to my heart because I see that, like you say, most DevOps teams come in with the tooling as a religion and then it just gets outdated or you don’t have budgets for that and you have to move to something else and then the reason why you’re doing it is completely lost. So yeah, I think stepping back and having abstraction is a great piece of advice.

Ganesh Datta 00:14:05 Yeah, I think that’s what makes great DevOps. DevOps engineers and SRE and Cloud engineers is almost having that product hat I know all of these roles are highly technical and so that’s why I’ve seen, really high functioning DevOps teams and SRE teams. Sometimes they even have a product manager embedded into the team that is extremely technical because you are kind of, your customer is the internal development team, right? That is who your customer is. We can talk about SREs customers, which differs slightly, but for the DevOps team, their customer is the development. And so, if you have a customer then you should be thinking about how do I enable them to do their job? that is your charter at the end of the day, right? And so really taking a step back and saying how do I enable those teams to do their best? And I think having that lens, having that product hat on, I think helps DevOps engineers kind of perform a lot better. And I think it gives you visibility into, hey, here are the things I should be working. So, you’re not going off and building things and wasting your own time. It helps you prioritize these are the highest impact things that I could be doing. And so, I think that product hat is super, super important.

Priyanka Raghavan 00:15:06 That’s very interesting because I, that was one thing I had not really thought about. So yeah, that’s good to know. So, apart from your traditional DevOps tooling skill, having a kind of ability to step back abstract, look at things at a little bit higher level will make you successful at your job?.

Ganesh Datta 00:15:23 Exactly.

Priyanka Raghavan 00:15:25 Okay. I wanted to now switch gears to SRE and I think from the site, reliability engineering book from Google, I remember this analogy, which of course as a mother just completely, made a lot of sense. I just want to talk about that. It says that the analogy is between software engineering and labor and children. So, it says the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort. And so I just wanted to talking a little bit about that, a quote, which is so true in real life, but also in software engineering or how do you think that kind of comes into this SRE role? Do you agree with that?

Ganesh Datta 00:16:05 Yeah, I definitely think so. That’s a really funny, funny way of putting it, but I think it’s totally true. And I think about the work that goes in before production, before things are out, that to me, and this is kind of a broader note on SRE generally, I think that the thing that’s really hard about SRE is it’s very much an influence role, right? you’re not just building things, but you need to get people to care about it. You need to get people to do things. it’s an extremely difficult role for that particular reason. Not even necessarily the technical side of things, which is challenging enough and especially because SRE teams and most organizations are operating at, a 1 to 30 to 1 to 50 ratio for SRE to regular product engineering.

Ganesh Datta 00:16:43 And so they’re trying to influence all these people to do things and that I think that’s where a lot of the hard work really comes in. And so, kind of thinking about the first part, what is that initial affront labor? It’s, okay, figuring out based on our charter again, what are the things that we don’t have that we need in order to get to a world where we can accomplish our charter, right? It’s not even how do we accomplish our charter, but how do we get to a place where we could reasonably figure out how to accomplish our charter? And so that’s where you’re setting up your monitoring and observability stack, you’re doing things like setting standards for tracing, for logging, for metrics. Everything kind of has to be standardized. You want people to be doing things in similar ways.

Ganesh Datta 00:17:17 That way you can kind of, things are flowing into the right systems, you have reporting build on top of that. And once you have all this stuff kind of defined, then it’s you’re running after people and saying, hey, you’re still running or all tracing system, can you please add the span ID to your traces? Can you do X, Y, and Z? You’re trying to push other people to do this. And I think that’s where a lot of that pain comes from for SREs is SREs given this charter to be, hey, can you make our company more reliable, right? And that’s fallen on the SRE team, but it’s not really a charter for the rest of the organization, right? And so, SREs trying to take their charter and make everyone else do it because that’s kind of what the role is.

Ganesh Datta 00:17:52 And so that’s where a lot of that initial upfront effort works is getting people to care about those things and driving that visibility. Because once you have that, then it’s a matter of, okay, we’ve kind of had this foundation and so now we’re seeing what the problems are in order to get to that final charter. And then it’s the same thing all over again. Now you’re just, is that kind of whack-a-mole? Right? It’s kind of the raising a child analogy, he is okay, it’s there, we got everything, but now it needs so much more nurturing to get to our final state. And so it’s okay, we’re going to start small, we’re going to be, everyone needs to set up your monitors. Okay, now we have monitors. Okay, now you’re going to set up an alert, you’re going to set up on-call, okay, you’re going to connect your monitors to your rotation, you’re going to make sure you have contacts, you have so on and so forth. It’s you need that foundation and really push the organization to get there and then you can start nurturing the organization to get to that final state. So, that’s kind of how I think about those two, those two sides of the equation.

Priyanka Raghavan 00:18:39 Yeah, I think when you talked about logging and the tracing, I think that is an art, I would say it’s almost, I mean maybe it’s a science, sorry, I should say that. You want me to say I think could be a book in itself or maybe?

Ganesh Datta 00:18:51 A 100% podcast.

Priyanka Raghavan 00:18:53 In itself, but yeah, that’s very true. But, switching into that, I think if I specifically come into the metrics angle. So, what would be the metrics that say the DevOps teams look at versus SRE? If you could just again break it down for us.

Ganesh Datta 00:19:08 Yeah, absolutely. So, when I think about DevOps teams, you’re thinking about developed productivity, things that. And so, your metrics are going to be more around the actual operational side of things, the developer operations side of things. So, things build fake, build flakiness. So, are there are issues with the build system or the specific repositories or services that are causing a lot of build failures, how do we prevent that? How do we detect that kind of stuff? Because that is where a lot of time goes away. So, actually taking a step back when you think about DevOps is how much time are developers spending actually writing code versus how much time are they spending dealing with tooling, right? And the more you can reduce the dealing with tooling side of things, the better. And so, things that, things like time to production is another great one.

Ganesh Datta 00:19:51 And so this is where the collaboration between DevOps and Cloud engineering really comes into play, it’s a time to production. It easy for DevOps teams to get things into their Cloud platform. But is it easy for developers to kind of traverse their systems into that so, time to code, time to production or time to whatever X environment. Things like basic build times, are there bottlenecks on the build systems? So, I think those are the kinds of metrics that DevOps teams are obviously looking at. I mean they have monitoring type metrics as well. If your Jenkins goes down, then obviously you have a problem. So, you’re looking at similar metrics and logs and things like that from your systems, but the things that you own are more of these kinds of operational metrics that tell you, hey are we accomplishing our charter in that same way?

Ganesh Datta 00:20:37 And so I think it’s interesting in that SRE, I mean DevOps kind of owns certain sets of metrics that necessarily. SRE on the other side doesn’t own a metric in the same way, right? They can’t impact their own metrics. If SRE is looking at uptime as their final goal or their SLOs and what they’re breaching at the end of the day, they can only tell developers, hey, your service is breaching a threshold and we’re going to page you or whatever. But an SRE team can’t do anything about it. Versus DevOps kind of owns their own metrics. They have these kinds of things that they are going to push forward. And I think that’s some of the slight differences there between the DevOps and the SRE side.

Priyanka Raghavan 00:21:10 Okay, interesting. So, the metrics can actually help DevOps teams get better, whereas SRE, even if they look at the metrics, theyíre depended on somebody else to fix it.

Ganesh Datta 00:21:19 Exactly. I think that’s where the pain comes in for the SRE side where itís, again, itís an influence job. You can only tell people, hey, something is wrong with your service and here’s how, here’s what we’re seeing. But you can’t do anything about it for DevOps. Again, that product lens, right? It’s you have not just technical metrics but you have business metrics or these kind of KPIs, right? That’s the interesting thing and you might have a whole bunch of SLIs underneath that but you’re tracking against business metrics. You’re not just looking at uptime or whatever, more technical things.

Priyanka Raghavan 00:21:48 So, I’ll ask you to also explain SLO and SLI again for us, just to make sure everybody’s on the same page.

Ganesh Datta 00:21:56 Yeah, absolutely. So, I think when you think about SLOs, SLOs are your actual objective, right? It’s hey, we are trying to get to 99% uptime or whatever, things that. So, that that is your final objective. The SLI is an indicator that tells you am I meeting my objective? That’s as simple AST. The way to describe it as the SLO is literally what are we trying to accomplish? And the SLI is the indicator that tells us if we are doing that. So, your uptime metric could be your SLI and your SLO is the target. So I have a 99% uptime SLO. The SLI is the uptime indicator, what is our current uptime? what is it looking over time? So that’s kind of how I think about SLO and SLI.

Ganesh Datta 00:22:37 And then you have SLAs which are more of the actual agreements or promises. So, you might have a six nines or a, let’s say you have a three nines SLA. So, you’ve committed to a customer that you have a three nines SLA from, from uptime, your SLO might be four nine s because that’s your objective. Because if you meet that and internally you’re tracking correctly against your agreement, your legally binding agreement with the customer and your SLI is going to be the actual indicator that says how are we doing against our uptime? What is our current uptime? So that’s kind of telling us where we’re going.

Priyanka Raghavan 00:23:09 So in this thing where we have the service level agreements for SRE, I mean with the customer, which is your end user, do we have something similar for DevOps? End user is the developers, can the developers say this is the agreement I want? Is that more a collaborative effort?

Ganesh Datta 00:23:24 Yeah, that’s a great question. I think the best engineer organizations view that those internal relationships as extremely collaborative. And I think there needs to be collaboration between all of those teams. And this is kind of a whole topic of its own because I think what engineering organizations should not do is create silos between SRE and DevOps and development. Those teams should all work hand in hand, right? It’s okay, your DevOps team is kind of thinking putting their product hat and they’re thinking with and talking to developers and saying, hey, what are the areas of friction? How do we make it easier for you to build things and just focus on that value, right? And but your SRA team is thinking about, yeah how do we get people to do their monitors and their dashboarding and all this stuff?

Ganesh Datta 00:24:04 But you think about those two why is SRE kind of pigeonholed into post-production? in theory those things could be automated for you as well, right? if you are following a standard framework and you generate new projects out of that framework and then you have a standard logging system and you have a standard metric system in theory your initial framework and your initial build could generate all the same things that need to get into your SRA team cares about. So your SRE team and your DevOps team should then work together and say, hey, I’m the SRE team, these are the things that we need our developers to be doing before they go into production. How much of that can we automate for developers as part of their pre-prod systems, right? Are there things that the build pipeline could be doing as tagging your images with certain shots or whatever so that that flows into our monitoring?

Ganesh Datta 00:24:48 Are their things we can build into their software templates that’s going to do logging the right way? And so SRE and DevOps should be working together to say, hey DevOps, can you guys help us do our jobs better from day one so we’re not scrambling afterwards, right? And the same thing between the Cloud platform and the DevOps teams, DevOps ops team was saying, hey, here’s what our current status quo is. This is what we need from you in order to do our jobs better. So, how do we figure out, how are we structuring our platforms that’s going to be a lot easier, things that. And so, I think all of those teams especially should be collaborating between each other and that’s going to make the developer’s life a lot easier. So, imagine the dream world where, a developer comes in, they don’t necessarily know what all the underlying infrastructure is, right?

Ganesh Datta 00:25:30 It’s maybe on Kubernetes it doesn’t really matter. I come in, I have a set of software templates, I say okay, I want to create a spring boot service. And I go into whatever our internal portal is, I select a spring boot template, boom, it creates a repository for me with the same settings that DevOps recommends, it generates the code. That code is already preconfigured with the right logging structure, it’s configured with the right monitors, it’s going to get set up, it’s configured with the right build pipeline that integrates with what DevOps already set up. It’s integrated with sonar cube and the metrics are already going there. Boom, I write my code, I merge it to master deploy pipeline picks it up, it goes into our infrastructure metrics are starting to flow into whatever monitoring tool you’re using. You’ve got your metrics set in place. As a developer, all I did was I just followed this template and I did a couple things and everything just magically works. And that’s the dreamland that we can get to. And the only way you can get there is if all of those teams are collaborating with each other really, really closely and all of them are kind of wearing their products hats and thinking this is not just a technical problem, it’s about how do we as an engineering organization deliver faster for our end customer users. And so, I think that’s kind of what engineering organizations should be striving to.

Priyanka Raghavan 00:26:36 So actually in a way all of us should be working on that SLE with the end user.

Ganesh Datta 00:26:40 Exactly. Yeah. Everyone should own that just to some extent.

Priyanka Raghavan 00:26:44 That’s great. I wanted to ask you also in terms of roles, when we go back to it, there used to be this role called a system admin. Is that now dead? We don’t see that at all. Right?

Ganesh Datta 00:26:54 Yeah, I think that’s kind of gone by the wayside. And I think you still see it as some organizations where if you have legacy infrastructure that you need to operate in some ways then that kind of falls under the Cloud platform teams. And so, I think that’s kind of merged into, depending on where you lived as a system admin, you might go more into the Cloud platform engineering team or you might be more on the DevOps side. I think there’s not really any overlap with the SRE side of things, but if you’re CIS administrative skills were around yeah pipelines and build systems and being able to monitor things that, that stuff, you might go more into the DevOps side of things. If you’re a heavy Unix person and you’ve got, all your command and you can go figure out networking and those kinds of things, you’re going to be a great fit for Cloud platform engineering. And that’s probably the future there. So, I think it’s like CIS admin is kind of a very broad role. It’s, hey we’ve got these mega machines and we have no idea what the hell those systems are doing and we need somebody that’s a Unix group to figure it out. But now it’s, okay we’ve got specialized teams that have those charters so you can kind of figure out what exactly you want to be doing and really focusing on all that.

Priyanka Raghavan 00:27:59 And would it be that from that similar context, would it be easier if a developer wants to go to a DevOps or an SRE role, would it be a benefit for SRE or say DevOps?

Ganesh Datta 00:28:11 I think it’s interesting again because what we usually see is a lot of developers really care or specialize in one of those. There’s people that really care about infrastructure, they love, they come into a young organization, things are starting to get a bit hairy and there’s , hey I’m going to take a week, I’m going to set up Terraform, I know set up infrastructure as code, I’m going to set up our VPCs, whatever that’s going to make my life easier, it’s going to make me a lot happier so I’m going to do that infrastructure stuff. Okay, you’re probably going more towards Cloud platform engineering at that point, right? So that’s kind of one set of engineers and then you have another set of engineers that are, oh my god the bill’s taking forever, we got to go in and fix that, fix those systems.

Ganesh Datta 00:28:48 Everyone’s doing things differently. I hate our lack of standardization. I want to bring some sort of standards and order to the chaos probably more this DevOp-sy type space. And then there’s some people that really care about monitoring and uptime and standards and tracing and logging and that kind of stuff. They kind of freak out and be, I have no idea what’s going on in production, I have no visibility. I feel I can’t sleep at night because I don’t know what’s going to happen. Okay, you’re probably more leaning into that SRE space. So I think what we see is developers usually have one passion area that they really, really like or they spend a lot of time in. And so, I think that kind of naturally they have a path to those worlds.

Priyanka Raghavan 00:29:27 What about this ability to, there are certain engineers who come in as DevOps engineers, so they have this ability to write custom scripts things to do all the automation. So, is that a big skill to have in both these spaces or only say DevOps?

Ganesh Datta 00:29:44 Yeah, I would say I think very solid software engineering skills when it comes to coding probably is more required on Cloud platform engineering and DevOps because yeah, you’re going to be hacking things together. You’ve got bunch of systems that got to talk to each other, you’re more active in that space. So, I think generally speaking, you need to be good at coding, not necessarily system design or architecture or things that. that high level abstraction. And I think that’s where we’re when a DevOps or a Cloud platform engineer is coming into a software engineering role that’s kind of where theyíre really good at writing code but maybe need to take a step back and think about software design principles. In some cases SRE is kind of the inverse where you don’t necessarily have to be an amazing coder but you need to be able to think about the systems and how they interact and more of the architecture side of things.

Ganesh Datta 00:30:35 And so I think that’s where their skillset is. And so maybe not so much the minutia of, hey, how do I get out of action to talk to our legacy Jenkins build, which is part of our migration and blah blah. That stuff is probably two in the weeds for an SRE team, but they’re thinking more about, hey, how do our systems interact where the bottlenecks, the critical areas of risk. And so, there’s definitely some overlapping skillsets set, but that’s kind of where I see SRE teams have most of their thinking hats on.

Priyanka Raghavan 00:30:59 Okay, so more of the details on the system interactions and things that and how your systems talk to each other would be DevOps and taking a step back and looking at flows to see where bottlenecks are would be SRE.

Ganesh Datta 00:31:12 Exactly. Yeah.

Priyanka Raghavan 00:31:13 Okay. I now want to switch gears a bit into say the communication angle. So, one of the things that is interesting from SRE is, and I guess it’s also in DevOps, is when the incident occurs, they do this thing called is blame free postmortems. Can you explain that? I believe from at the book on the SRE, I mean the site reliability engineering from Google, they talk a lot more about this, but is it a similar concept also for DevOps?

Ganesh Datta 00:31:38 Yeah, I definitely think so. I think if there’s an issue with how somebody has set up their pipelines or they’re not integrating with your tooling the right way or whatever, I think your first question should be what was the gap, right? was there a gap in our tooling that said, hey, I need to go off and build my own thing because the current systems that we provided don’t work, right? What is the reason why the developer went off the rails somewhere that went off outside of those guard rails to go and do something that the DevOps team hasn’t kind of given their stamp to. That should be our first question. Again, going back to the product hat, right? It’s don’t blame the user, there might be something wrong, right? Is there something that we should be working on?

Ganesh Datta 00:32:13 That’s kind of step one. Step two is, okay, maybe if there was nothing then why did they kind of go down that path, right? Was it a lack of evangelism? What did they not know that these systems existed? Do they not fully understand it? Okay, if that’s the case, then maybe there needs to be more education within the organization, right? Taking opportunities for lunch and learn thinking opportunities for internal guides or wikis that talk about this stuff. Maybe there should be automated tooling and, the kind of thinking about what, what are the process things that went wrong to get here? And so again, it’s not about blaming the folks that did something quote unquote wrong, but understanding how do we make sure that doesn’t happen again? Because sure you’re going to blame someone all you want, but you’re going to hire somebody else, somebody else is going to do the same thing again and you’re just going to keep blaming everybody.

Ganesh Datta 00:32:55 You’re going to figure out, hey, how do we as a team just accept that this is going to happen and make sure that we have processes in place to ensure that it doesn’t, how do we make sure that we are able to accomplish our charter outside of what those teams are doing, right? that’s kind of what it comes down to. blame-free postmortems as well. Its things are going to happen, incidents will always happen no matter how brilliant of a programmer you are and that’s right team, you are, something is going to go wrong. And so, when something goes wrong, you want to take a step back and say, okay, something went wrong, doesn’t matter who did it. How do we make sure this doesn’t happen again? That’s always a question is like, how do we prevent something this? What were the gaps, right?

Ganesh Datta 00:33:28 We know it’s going to happen and we need to make sure it doesn’t, and so the DevOps team should be thinking about it the same way. Itís we know it’s going to happen again. How do we make sure it doesn’t? And so, I think taking that lens is super important and I think there’s more of a collaboration element here as well where they need to be working with developers and say, hey, how do we make sure that doesn’t happen again and what can we be doing in order to better enable you? And so yeah, I think blame-free culture I think is just important generally. And I think DevOps should be taking that kind of product lens again when they see these kinds of issues on hey, why are people not doing the things that we hope they should be doing?

Priyanka Raghavan 00:34:00 That’s interesting when you talk about the collaboration angle. And so this question might be a little bit, a long-winded, but one of the things I noticed is whenever we have an incident and when you do this root cause analysis, then there is of course, analysis done on what really happened, which maybe the SRE team looks at and then a ticket is created and then that either goes to say a DevOps or developer team and then there’s almost, even though we know that there should not be a plane free culture, but then it almost looks this work is given to different teams. And then there’s this problem of like you said before, operating in silos, right? So that again, then there’s this problem there. And so, I almost wonder, do we need to have a kind of a facilitator role as well to have this kind of blame-free postmortem and how does communication play with all these different roles?

Ganesh Datta 00:34:49 Yeah, I think when it comes to postmortem specifically, in theory the facilitator should be SRE and then it’s kind of like, kind of a conflict of interest, but that falls under their charter rights. If their goal is to make an improve uptime or improve reliability, doing good postmortems falls into that world, right? It’s the better you can do your postmortems, the better you can follow those action items that are coming out of it, the better you’re going to be in terms of accomplishing your own charter. So as in your best interest to enable other teams to do the things that they need to do in order to accomplish your own charter. Again, kind of going back to the idea that SRE is like an influence organization. And so, when you think about doing a postmortem, you want to be facilitating those conversations and say, hey, did SRE provide you the tooling to say something went wrong?

Ganesh Datta 00:35:33 Were you able to detect it in time where you alerted in time, what are the foundational pieces missing? And if so, we’re going to take those action items back and fix it because that’s our job, right? That’s kind of on our systems. And then facilitating those action items say, here is the clear outcomes of this postpartum, right? Somebody had to take charge and say, okay, out of this postpartum there’s five action items. And in theory, I think what happens in a lot of cases is you create these jury tickets, there’s 15 tickets that come out of a postmortem and there’s no prioritization in place. Nobody, they’re just there in the void and people either take them or they don’t. And that’s a, it’s the classic thing that happens with these postmortems, right?

Ganesh Datta 00:36:12 And so I think coming out of a postmortem, the SRE team should be saying, hey, we can’t leave this postmortem is not over, until we have an idea of prioritization, right? Itís, which of these things are must haves? Which of these things are should haves and which of these things are nice to haves? And so, the must haves are going to be, hey, we are going to bother you incessantly until we know those must haves are complete. Because those are kind of what you have agreed to say. Okay, these are things that have to be fixed now and we’ve kind of all agreed on this within this postmortem and the should have, there’s something you probably want to track somewhere. It’s, hey, are we building up these should haves? How do we continuously go back to the development teams and say, hey, we need your help to prioritize these things.

Ganesh Datta 00:36:48 And so I think, yeah, the SRE team kind of plays that facilitator role a little bit, but it also comes down to those engineering managers on the development teams as well, right? It’s if you’re an engineering manager, if you’re a product manager, you can’t lose track of the fact that you are working closely with the SRE team, right? You are enabling the SRE team to do their charter, right? If you are just, hey, screw you guys, we’re just going to go off and do our own thing, you’re not creating a good working environment internally. So as an engineering manager or product manager, it is your job to kind of go back and say, hey, how do we as our team help our fellow sibling teams to do their jobs as well? So, we are going to do our best and they’re going to do their best. I think that’s the kind of general engine culture you want to create. But yeah, the SRE team I think is the facilitator within the postmortem boundary itself.

Priyanka Raghavan 00:37:34 Yeah, that’s interesting because I read this article which said that the SRE practice involves contributions to every level of the organization. I think that probably makes sense because they are then playing that facilitator role, right? Because they’ll talk to I guess the product owners, the developers, the engineering managers, and then yeah, and I guess the DevOps teams to have this communication. So, would you say that, so this is another skillset set for an SRE, a good communication skills?

Ganesh Datta 00:38:02 Absolutely. Yeah, I think it goes back to SRE is an influence role, right? Itís influence in many cases when an SRE team is formed, it was probably because you are starting to see reliability as a key business driver, right? There’s a reason why you’re investing, nobody’s going to invest in reliability if it doesn’t matter, right? And it’s, thereís some key business reason why you’re investing in reliability and uptime and things that. And so usually that that team falls under the VP engineering or the CTO directly, there’s the development team or the SRE team kind of directly reports up into the VP engineering. And so, thereís a clear line of communication there, but then you also have kind of visibility to the rest of the organization and you need to influence the rest of the organization.

Ganesh Datta 00:38:40 And so being able to communicate to leadership where the bottlenecks are and what you need resources and help in kind of driving across the org as well as communicating to directly to engineers and within your own team. I think that’s kind of a unique skillset that SREs need to have. Because in some cases, the SRE team cannot necessarily directly influence the engineering team directly and they almost need to say, hey, VP here’s what we need for the origin organization. We know it’s a broader effort, but here’s why it’s important and we need your help in order to make this a key initiative. And so, it’s kind of an up to go out type of a model. And you see this in a few other functions as well. Security is a great example of this where security is, okay guys, figure out how you’re going to make our software more secure.

Ganesh Datta 00:39:23 And they’re trying to get developers to do things and they’re trying to communicate up to the CISO or whatever. And it’s a kind of a similar thing where it’s go up to go out type of a system. And so, SRE is very similar in that case where it’s you need to be able to communicate up, you need to be able to communicate out, you need to figure out how you’re going to drive that influence. And so, there’s definitely a lot of communication involved and it’s not the first thing you think about when you think about SRE, but it’s, I think that’s where a lot of people go, go into SRE kind of have that initial shock is there’s a lot more people stuff going on in this role than you would initially expect. It’s not just a technical role, it’s one of the fun things about the role as well, but it’s definitely is something that people don’t realize as you go into it.

Priyanka Raghavan 00:39:59 Okay, that’s good to know. And I guess now moving into the sort of the last bit of the section on this episode, I want to talk a little bit on the day-to-day life of an SRE versus a DevOps as you would see it. So, what would a good day for an SRE took?

Ganesh Datta 00:40:15 Good day for an sre, you’re probably writing a doc somewhere on your future state on, what reliability looks like. There’s no incidents. Monitoring and metrics are flowing beautifully. There’s no postmortems, all the action items are empty. There’s nothing in Jira. That’s a beautiful day for an SRE. Now well, does that ever happen? Probably not. But a more realistic day I think is a combination of kind of, yeah, goal setting, kind of thinking about doing analysis on the metrics that you were accountable for, for uptime and saying, hey, where are the issues? Are there things that are popping up that we don’t really know about? Who should we be talking to about these things? I think it’s probably part of your day. Another part of your day is probably talking to other engineering teams and talking to them about SLOs and adoption and things that.

Ganesh Datta 00:40:55 That’s going to be part of your day. Another part is evangelizing things. So, you’re probably defining SRE readiness standards and things that. And, communicating that to the rest of the organization. One thing we didn’t talk about at all is the kind of initial SRE concept of being the initial on-call team as well. So, I think there was a period of time in which SRE was also the first line of defense. they would be on call for things and then they would escalate it to engineering teams. What’s interesting is we don’t really see that as often these days. I know Google still kind of does things that way, but it’s more of a you build it, you own it type of model. And most organizations now, and so I would say in some organizations and SREs day-to-day might be, yeah, fielding the pager or whatever, being on call, call for things that are not their own things, but things that other people have built.

Ganesh Datta 00:41:37 But yeah, we don’t really see that happening as often these days, especially at companies that are sub thousand engineers. But it’s mostly, yeah, the teams are going to be on-call for the things that they own or maybe there’s a separate support team that’s on-call generally that’s going to be escalating things through the pipe. But yeah, I think that’s kind of generally the day-to-day is a bit of, yeah, your standard observability monitoring, incident management being part of these ongoing issues, being that sounding board, the post-mortem facilitator, the incident facilitator, evangelism, and the kind of goal setting and working with the DevOps and the Cloud imaging team and things that. So those are kind of the things that we usually see in a general day to day.

Priyanka Raghavan 00:42:13 Okay. And I guess you said, so a bad day would be if, would I only have a bad day if I was a first line of defense or, I mean, I guess you could have a bad day in other things, but would it be more stressful if I was so almost the first line of defense.

Ganesh Datta 00:42:28 Yeah, I think, I think that’s what I would get really bad. But I think you can still have a very bad day if there’s incidents generally across the organization. Because we talked about the SRE team is kind of the facilitator, so they’re still operating as part of those incidents. They’re being that standing board, they’re facilitating it, they’re looping in the right people they’re making sure that their systems are looking good, they’re making sure that the right data is being provided to the teams so they can make clear decisions. They’re providing insight into, yeah, the escalation, escalation path escalation policies. So, they’re kind of, not in all cases, but in many cases they’re kind of running that incident commander type role as well. So, they’re kind of in charge because yeah, that incident is directly affecting their final metric, which is uptime or reliability or whatever.

Ganesh Datta 00:43:11 And so it’s in their best interest to run that incident as smoothly as possible. And so regardless of whether the first line engineer where they, they are triaging and resolving incidents from the get-go or whether you’re, you’re it’s a be ability, you own it type of a model, you’re still involved in those incidents and you’re still trying to figure out and help those teams and so on top of everything else you’re trying to do, I think that’s can be a bad day. Another example of a bad day is you’re trying to get people to do things, but you don’t have any say into it. And other teams are saying, hey, we’ve got these deadlines, we’ve got these other things we’re working on. Our manager says we don’t have time for this, and you’re just blocked. You just can’t do anything because you’re blocked on everyone else.

Ganesh Datta 00:43:48 And I think that’s almost the most frustrating thing where it’s, I am not able to do my job because I’m not getting that buy-in from other organizations. At no fault of their own either, right? It’s they have their own things that they have to be working on, they’re managers and director, whatever, telling them this is your priority. Ignore reliability, it doesn’t matter. But no reliability matters, that’s what matters to us. And so how do you kind of cross those boundaries? And so, I think a really bad days when that collaboration breaks down, right? And it happens in every organization, and you need to be working on that. I think that can be a very emotionally draining, bad day because you just can’t do what you’re trying to accomplish. So, I think those are super examples of what bad days can be.

Priyanka Raghavan 00:44:25 Okay, great. I think, that kind of really drove home the point where, yeah, you could get terribly frustrated if you can’t really do your job because it depends on someone else. Yeah. I think the obviously I have to ask you now what a bad day for a DevOps engineer looks like? Is it just that, see if GitHub is not working or is down or see as your DevOps is down or Jenkins is down, is that a bad day?

Ganesh Datta 00:44:50 Yeah,I would say when the actual things that you own are down, that’s kind of a bad day for everyone and it’s you build it, you own it type thing again, you own those systems, the systems are down and your developers are, what the hell? I can’t do anything. That’s probably a really bad day for developers for, for the DevOps teams. But another lesser thought about bad days. When you hear frustrations from developers, kind of just generally it’s this is not working for me, this suck. I’m not able to build, it’s super flaky, whatever. It’s the things that you’re building are not working for teams. And I think that can be really frustrating. Again, from an emotional way, it’s like, hey, whatever we’re trying to do is not working and are, we’re not able to enable those teams.

Ganesh Datta 00:45:26 And I think again, this is where for both the SRE and DevOps teams, that product tag, if you’re a product manager for a consumer app and you hear consumers saying, this product sucks. I don’t want to use it; I’m going to churn whatever. That’s what sucks as the product manager is the decisions that we made clearly are not working or weíre not able to execute on our goals. And I guess in the consumer app people might churn in this case. Obviously, people are not going to churn but they’re going to complain or youíre going to feel that frustration kind of bubbling up and you may not be able to do anything about that. So, I think that can be a bad day is youíre working on things and it’s not working correctly for teams. You’re not enabling teams the right way and there’s some gap in, what you thought was going to be the right path forward. I think those days could be very emotionally taxing and emotionally a bad day for DevOps teams.

Priyanka Raghavan 00:46:10 And to come back on a positive note. And a good day would be when nobody’s complaining?

Ganesh Datta 00:46:15 Yeah, when things are just happening and you see a lot of activity on your people are building things, people are deploying things, everything’s just magically happening, new projects are being created and nobody has any questions for you, nobody has any feature requests for you. That means you’ve almost taken yourself out of the equation. Itís you have billed a system in which people can operate without the guidance of DevOps and everything is just working seamlessly. I think that’s a wonderful day. It’s hey, the stuff we’re building is working and teams are enabled and teams are off just building things and doing things for the business as opposed to grappling with infrastructural things. So, I think that can be a really, really satisfying day for DevOps teams.

Priyanka Raghavan 00:46:48 That’s great. And now that you’ve laid all of this out for us, who do you think gets paid more? Is it an SRE or a DevOps?

Ganesh Datta 00:46:56 I think nowadays it’s starting to kind of get a bit more equal. I think what we see is DevOps teams can be a bit more junior in some cases. So, I think that’s where some of the paid disparity comes is you can probably get somebody kind of fresh out of college and new grad who has some coding experience. You can train them to be good DevOps engineers and so you can kind of get away with the fewer junior folks, whereas SRE teams are a bit more experienced, they need to understand where bottlenecks can be and best practices and all that stuff. And so, I think that’s why on average you see SRE teams might be being paid more. But I think it’s because, DevOps teams in a lot of cases just have slightly more junior folks across the board. But I think, once you’re kind of mid a career on both, you’re probably at the same pay grade.

Priyanka Raghavan 00:47:38 Okay. So that’s interesting because I wanted to ask you about the carrier progression for SRE versus DevOps. Would I be right in saying then after a point, maybe would there be a stagnation for a DevOps or is that not the case?

Ganesh Datta 00:47:52 Yeah, I think it depends on the organization. If DevOps is kind of just working within these pipelines or whatever, itís thereís not much more you can do. Maybe you can get into management and stuff. And so, I think it really depends on the organization because in some cases itís thereís paths to, I mean it could DevOps could live in the broader developer experience, developer productivity orgs. And so, itís one piece of that. And so, kind of going up into running or being a part of the broader developer experience team or being kind of in charge of that I think is your career progression and we’re seeing a lot more developer experience and developer productivity teams coming up in more organizations. So, I think they’re starting to be an even more clear path for DevOps folks.

Ganesh Datta 00:48:32 So I think that’s one career path. But at other organizations sometimes it might be moving more into platform or Cloud engineering, going up the ranks there or I think maybe SREs. I think that’s where kind of people have a bad taste in their mouth for DevOps and I think that’s why people are trying to rebrand it or rename it into all these other orgs piece because in some cases, yeah DevOps have been stagnant because has your organizations haven’t really thought about that charter. Why do we have a DevOps team? It’s for a developer experience and productivity and efficiency. So why not give DevOps the opportunity to own that entire thing? And so that’s why itís like, yeah we’re kind of calling IT developer experience and things that now. And so yeah, I think if you or your organization where there’s just DevOps and they don’t own anything else, then yeah, it’s probably going to kind of stagnate. But yeah, if you have the right opportunity and the DevOps team is within the right organization, there’s a really great path there.

Priyanka Raghavan 00:49:21 That’s very interesting. So, everything kind of ties back to the charter. So even I think, so if your charter is clearer and so as you get more mature then maybe the carrier progression is also better for the DevOps teams.

Ganesh Datta 00:49:33 Exactly, exactly.

Priyanka Raghavan 00:49:33 That’s great. Ties in very well with how we started. So, I guess the next question would be do you see many other roles that emerge from these roles in the future?

Ganesh Datta 00:49:45 Yeah, I definitely think so. I think from an SRE standpoint you probably see people starting to specialize in individual parts of SRE. So, things like moral is starting to see that and people who are really good at monitoring and observability, people who are really good at kind of like standards and governance and compliance and things like that. People that are really good at internet management. So maybe you might have people that kind of specialize in that. And so, as we learn more about these roles, I think we are going to see more specialization around there. And so, I think that’s something that for sure we’ll see. And then I think in terms of the DevOps side of things, you’re probably going to see specialization in specific parts of developer experience, right? So, it’s going to be things are you working on internal developer portals? Are you working on observability and metrics for our developer experience side of things or you’re working on pipelines, are you going to be a product manager within DevOps? Right? I mean we talked about that it is a product hat so is that going to be a thing as well? So, you’re thinking all of those things are examples of where we might see a lot more specialization and individual roles kind of being carved out of these broader spaces.

Priyanka Raghavan 00:50:46 Okay, so I think you talked about something called developer productivity that are organizations which have a team that does that, does it?

Ganesh Datta 00:50:53 Yeah, dev prod devex, I think is what we see a lot of. Okay. Because I think they finally realized hey this is the charter, right? Our charter is to make developers more productive and enable them to focus on building the stuff that actually matters. And so, I think that’s what we’re starting to see now is, okay, if we acknowledge that that’s a charter, let’s call the team data, it’s developer productivity and all these things kind of fall under developer productivity and it’s the foundation for just general product development work. So, we’re starting to see more organizations build out the team and again, yeah, this goes back to the charter being a lot more clear.

Priyanka Raghavan 00:51:25 And also in terms of, you also talked about things observability and rules coming from there. That’s also very interesting. Do you see actually things that that exist today? Do you have an observability team? I’m just curious about that?

Ganesh Datta 00:51:38 Yeah, we see that all the time. A large organization, so not necessarily at Cortex but we see a lot of our customers, they have folks that are specialized in observability and monitoring because in a large organization you might have many tools that are all kind of flowing and generating data and different types of metrics and you want to report on things, and you want those DA that stuff to flow into a single place. You want to assess standards on how you’re doing monitoring and alerting. It was so many things that fall under that umbrella. It’s hey, we’re just going to have a team of people that are full-time thinking about this and doing this versus trying to have them do 20 different things. Because if your focus is more around yeah kind of the SLOs and the adoption and the best practices and, things that, you’re not going to have time to think about the minutiae and the nitty gritty of monitoring stack as a whole. And so, it’s we’re going to give that team a charter. It’s anything monitoring related that’s you guys that go figure that stuff out.

Priyanka Raghavan 00:52:25 So it’s all boiling down to the charter, it all comes down to that . So, I have to ask you, is that a role in itself for the future, writing charter ?

Ganesh Datta 00:52:35 I think a good executive leadership team, I think that’s what they should be doing. you think about a good VP engineering or a good CTO is coming in and setting that, that charter. I think truly everything comes down to that. It’s when you hire an SRE team, you need tell them here is exactly what’s wrong today and here’s the future we want to get to and give them the autonomy to go and get to that final world, right? And I think that’s my problem with kind of this whole idea of OKRs is key results, right? It’s you’re going to give them, oh we want these metrics to go up by X percent. Okay cool, maybe they’re worst of the larger organization, but if you’re building your SRE team from the ground up, it’s more going to be, here’s our final end state and you as a team figure out how you’re going to get us there and hold yourself accountable to that.

Ganesh Datta 00:53:15 That doesn’t mean not having key results doesn’t mean there’s no accountability, but you need to help them define that vision for how they’re going to get there. And so, I think that’s why that charter is so important. Even things for SLOs, right? It’s a lot of organizations will come in that’s, oh Google does these SLOs, we’re going to do the same thing. But if you’re a smaller team, maybe your SLOs are not necessarily uptime driven, right? Your SLOs might be hey we have a payment system, and our payment fraud rate is X, Y, and Z and so we want to drive that particular rate down and that is our business service objective, right? That’s kind of some of the things we want to think about. So, the SRE team should be given that again, if the organization has a charter, SRE team can say okay, how do we get and enabled teams to find, get to that state? And so, I think, that’s why you see in a really high performing organizations, every team knows why their team is important and what their goal is and they can just work towards that with autonomy. I think that’s why it’s super important to have the charters and I think that that role really falls at the very top, leadership needs to be setting those goals at a very high level and then it needs to trickle down as well. So yeah, I think that’s where the charters really start.

Priyanka Raghavan 00:54:15 So I guess if I were to summarize this whole thing apart from say the DevOps versus SRE debate that we started off with, some of the key areas that I’m seeing is that we need to like, that final SLE, everybody should be looking at that. So that’s one angle having a good charter and I think this whole communication piece comes from strong leadership. I think that’s one big thing, but how do you also trickle that down to these individual teams who are operating? How do you find that purpose? Is that something to, would the recommendation then be that you go for customer workshops or something that? you see what the end user does with even people who are down in the really down in the hierarchy and for them to get a feel of, that what their work is important. How do you in your experience, how do you get that vision driven down to them?

Ganesh Datta 00:55:05 Yeah, I think a lot of it comes down to cross team communication. Communication upwards as well. And so, as an SRE team, if something that you really want to drive, right? You want to take a step back and say hey, how does it affect the bottom line? Maybe there’s a quantification element to it. We are seeing X hours being spent on incident resolution and if we had more visibility or automation around automatic incident resolution, who would save X hours? And so, this is why in investing in this infrastructure and this monitoring and tooling is going to be super important. It drives X percent engineering cost. And so, hey, now your leadership understands why that’s super important and how that gets you to your charter and then they can then communicate that to the rest of the organization. You can say, hey, we’re not just doing things for the sake of doing things, here is the impact, right?

Ganesh Datta 00:55:49 You want to always define that if we do X here is going to be the future state, right? It’s you can just go to other teams and be, we need you to do X. They’re not understand that, right? It all comes down to that collaboration and this is just basic communication practices as well, right? If you’re an engineer working in a product team, you don’t want your product manager to say here’s a ticket, go implement it, right? It’s here’s what we’re trying to do, here’s how this helps us get to that final state. And then as a developer you feel, hey I’m part of a bigger thing. I have this impact; I understand why I’m doing the things I’m doing or why this is super important for the broader organization. And I think DevOps and SRE is no different.

Ganesh Datta 00:56:22 You can’t just say here’s what we’re doing, here’s we need everyone to migrate onto CircleCI. Oh my God, I’ve got 15 other tickets I’m working on. You can’t just tell me that. It’s hey, it’s because we’re seeing a lot of whatever build failures and we think that these particular features are going to help us get there and therefore that’s going to help you by reducing your cycle time on PRs. You want to have that communication, and if even when if we talked about Cortex and developer portals, which is what we do, we tell people saying, hey, if I had a developer portal I could do X. Set that vision and say hereís why we’re doing this. And then you can get people bought in and say, oh my God, that future end state sounds awesome. How can we help you get there, right? So, the more you can set that final end goal and a very concrete end goal, the easier it’s going to be for people to feel, hey, I know why I’m doing the stuff I’m doing. It is high impact, it’s meaningful. So, you can’t just give people things to do, but you got to tell them here’s why we’re doing it and here’s the impact that you’re going to have.

Priyanka Raghavan 00:57:15 So, I think, if I were to end it, so apart from the charter there’s also data which you, I said that concrete way of looking at it, right? So, charter, have concrete data to bind to the charter and then you can have all the magic and have a good communication and build a successful platform.

Ganesh Datta 00:57:33 Exactly. Yeah,

Priyanka Raghavan 00:57:35 It’s great. It’s been very enlightening for me, Ganesh personally and I hope it is for the listeners of the show as well. And before I let you go, I wanted to find out where can people reach you if they wanted to contact you? Would it be on Twitter or LinkedIn?

Ganesh Datta 00:57:50 Yeah, if you’re interested in hearing more about this stuff, obviously this is what I do for, for a living is working with all of these teams and helping them accomplish our charters. So, you can just shoot me an email at [email protected] and hopefully I will find it in my box.

Priyanka Raghavan 00:58:03 Okay. We’ll do that. I’ll also add a link to your Twitter and LinkedIn on the show notes apart from the other references. So, thank you for coming on the show.

Ganesh Datta 00:58:12 Thank you so much for having me.

Priyanka Raghavan 00:58:14 Great. This is Priyanka Raghavan for Software Engineering Radio. Thanks for listening.

[End of Audio]

SE Radio 544: Ganesh Datta on DevOps vs Site Reliability Engineering

Show Notes

Related Links

Related SE Radio Episodes

Transcript

Join the discussion

More from this show

SE Radio 729: Garth Mollett on AI Supply Chain Security

SE Radio 728: Clare Liguori on the AWS Strands SDK for AI Agents

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

Menu

Recent posts

Search

Search

SE Radio 544: Ganesh Datta on DevOps vs Site Reliability Engineering

Show Notes

Related Links

Related SE Radio Episodes

Transcript

Join the discussion

More from this show

SE Radio 729: Garth Mollett on AI Supply Chain Security

SE Radio 728: Clare Liguori on the AWS Strands SDK for AI Agents

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

Menu

Recent posts