SE Radio 472: Liran Haimovitch on Handling Customer Issues

Liran Haimovtich, CTO of Rookout, joins host Robert Blumen for a discussion of handling customer issues. The discussion covers types of customer issues; handling of issues by a customer-facing support team; how customer support can resolve issues; when issues cannot be resolved by support; escalation path to engineering; how issues are transferred from direct support to product engineering; support issues as a form of unplanned work; how engineering teams plan for support workload; difficult issues that take a long time to resolve; involving multiple engineering teams in an issue; and prioritizing work on support issues relative to new features and other types of work.

This episode sponsored by NetApp.

Show Notes

Transcript

Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].

SE Radio 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected]. Se radio is brought to you by the computer society. I is your belief software magazine online at computer.org/software. I see radio listeners. We want to hear from you. Please visit se-radio.net/survey. To share a little information about your professional interests and listening habits. It takes less than two minutes to help us continue to make se radio even better. Your responses to the survey are completely confidential. That’s S e-radio.net/survey. Thanks for your support of the show. We look forward to hearing from you soon

Robert Blumen 00:00:51 For software engineering radio. This is Robert Blumen. My guest today is . Lauren is the CTO and co-founder of a Rook tout, a production debugging company. He is active in the field of computer security and an advocate of agile lean and DevOps. He is a frequent blogger and speaker. They were on welcome to stop during, during radio

Liran Haimovitch 00:01:16 Halo, both it’s great being here,

Robert Blumen 00:01:17 They were on, and I are going to be talking about customer issues, escalated to engineering. Before we talk about escalation, let’s talk about what is a customer issue.

Liran Haimovitch 00:01:31 So customer issue can be a little bit of everything. I mean, whether it’s an alert going off in a, in a production monitoring system, that’s showing something might be wrong for a certain customer or a group of customers. It can be a support ticket that the customer opens. It can be a feature request. It’s in fact, any indication that either the customer is complaining or asking for service, all that something is going wrong. And sometimes even obviously the worst of those customer issues are those. You actually don’t know about something went wrong, but you don’t have the monitoring feedback loops in place to actually understand that something has gone wrong.

Robert Blumen 00:02:13 What is the business impact of issues on the customer?

Liran Haimovitch 00:02:17 So working with our customers, we’re seeing that customer issues, technical issues can have a huge impact, huge business impact for them. And those can be different, whether you’re a SAS company, an on-prem solution provider, a B2B company, B to C company, but he got us a foot sexual year on we’re seeing that customer issues have a huge impact on the company’s business. In the B2B sector, we usually hear our customers. Describe those issues are helping their deal flow, whether it’s a new deal or a new trial that gets delayed or gets pushed back or even lost to a competitor, whether it’s a renewal or an upset that goes with that becomes at risk due to customer dissatisfaction, usually an issue for B to C companies. You often find in those impact impacting the funnel and impacting conversion rates out various flows of the funnel, impacting customer satisfactions, NPS calls, federal aid, and so on and so forth. And we’ve seen that almost no matter what your business metrics are, customer issues and poor quality can impact them negatively.

Robert Blumen 00:03:29 It must be very important then for customers in any business, to be able to rely on the products and services they use, do organizations typically offer any kind of a guarantee about how quickly issues will be addressed?

Liran Haimovitch 00:03:45 So we’re seeing that some customers try to give those guarantees. I would say for the most part, those guarantees have very defensive and loyal likes. So you wouldn’t, those are what you tend to commit to are the very minimum, the stuff you would be able to defend against them court rather than the service you actually want to provide to your customers. Usually we aim to provide we every company I know am to provide the best service possible to their customers, resolving issues as fast as possible, and ensuring a smooth and painless experience while those guarantees those contractual guarantees tend to be more about when something is broken. If you call us, we’ll pick up the phone, but not necessarily. Once we pick up the phone, how fast are we going to help you in any meaningful way? You mentioned

Robert Blumen 00:04:40 One way that you learn about customer issues, your own internal monitoring system tell you something is down or not working as expected. If you learn about the issue from the customer, how does the customer communicate that to the organization?

Liran Haimovitch 00:04:56 So we’re seeing customers communicate issues to organizations in many different ways. Obvious ones are by contacting support, whether it’s through in-app China’s in-app chats, email addresses, or phone calls. Those are the obvious ones, but we’re also seeing customers communicate issues indirectly through product reviews and application reviews in the stores through recommendations of this recommendations on social networks, blog sites. And we’re even seeing that in their actions, sometimes you can detect customer issues as your metrics are suffering. And sometimes those metrics suffering those metrics going down. I mean, there’s an issue. There’s a body that’s hurting, preventing customers to convert. As you prefer, essentially, causing an issue,

Robert Blumen 00:05:46 Uh, customer issue. Then it’s an issue which is affecting the customer, but you’re not necessarily learning about it because the customer tells you there’s a lot of ways you might learn it. Is that right?

Liran Haimovitch 00:05:59 I would say that generally speaking the best way is to learn about the issue before the customer even knows that there is an issue or at least before it tells you. And the second worst way to learn about an issue is to have the customer tell you the worst way to learn about the issue after the customer has told other people about it, hurting your reputation,

Robert Blumen 00:06:19 What you don’t want. Then as somebody goes on social media and says, I use this product, it’s no good. It’s down all the time. They suck.

Liran Haimovitch 00:06:27 Yeah. That’s the worst possible way. I mean, that’s why you want to catch those issues as fast as possible and resolve them as quickly as possible, potentially even proactively reaching out to your customers where appropriate, letting them know that you’ve found and fixed the issues they’ve encountered.

Robert Blumen 00:06:45 So we’re going to be talking about customer issues, escalated to engineering, to what I had in mind then is it customer’s shoes go somewhere else. And then they get escalated to engineering. What I’ve seen in most organizations is alerts and monitoring go directly to engineering. So what are the issues that do not go directly to engineering? Where do they go?

Liran Haimovitch 00:07:07 So I would say so, as you mentioned, there are two major kinds of customer issues. The command that you find yourself through alerts monitoring and so on. And the second is the kind of comes from the customers themselves, by them letting us know that there is a problem. If the customer don’t want to contact us, then usually it goes to support or some other customer facing sales, like organization, whether it’s support customer success, account management or whatnot. On the other hand, we also seeing that when it comes to technical alerts, that can be many different people who might actually get that alert. It also depends on your organizational structure. You have it. And ops sometimes have a dev ops organization. Sometimes support is partially responsible for operating production, and there are many other stakeholders within the organization that might be involved in the process of identifying the issue, prioritizing it and determining what to do with it.

Robert Blumen 00:08:07 Let’s talk about the customer facing organizations that receive issue reports directly from the customer. What is the happy path for how these issues would be addressed?

Liran Haimovitch 00:08:21 So the happy path which not many organization are successful at is that the whatever customer service representative is speaking with you is going to be immediately able to identify your problem, fix it, and send you you on your way. And fortunately, that’s rarely the case. And quite often, those customer service representative need additional support, additional knowledge, additional tools, and then the gradually escalate. Some of those issues to more qualified personnel personnel have a deeper understanding of the system. And quite often goes all the way to engineering. And the thing is those things that get escalated to engineering tend to have two things in common. One is that they tend to be very difficult to solve because otherwise somebody else should have been able to solve it earlier. And second, that they’re important enough from business perspective, because if it’s a small problem, even if we didn’t manage to solve it, chances are it’s going to get prioritized. And only if it’s something that’s going to have a real business impact, will it go escalate and get prioritized by the product team management and so on and make its way all the way to the other side of the organization for engineering to take a look at it and potentially fix it?

Robert Blumen 00:09:41 Is it often difficult to reproduce issues that customers are reporting?

Liran Haimovitch 00:09:47 I would say from everything we’ve heard, reproducing issues is the biggest term reproducing customer issues is the number one biggest problem engineers are facing when they’re trying to tackle and resolve those problems even worse. It’s not just about being able to reproduce the issue. It’s about being able to reproduce the issue reliably in an environment you can control. You can observe, you can understand. And unfortunately, quite often the production environment doesn’t have those characteristics.

Robert Blumen 00:10:21 You’ve talked about the issues that cannot be addressed immediately by the customer-facing organization are escalated. Who makes the call? Is this something, does the customer say, I don’t think I’m getting anywhere with support. I want engineering or does customer facing say, this is out of our scope. We need to get engineering involved.

Liran Haimovitch 00:10:44 So what we usually see is a customer facing representatives say, we can solve this issue ourselves and they try to push it on or escalate, escalated to somebody who can help them. And unfortunately that’s where prioritization kicks in because it’s not just about what can be resolved. Usually there are dozens, sometimes hundreds of issues, feature requests and ideas organizations have to address on a daily basis. And yet they can’t resolve all of them. So once a set of issues that can, haven’t been resolved properly by the first year of support by the first year of interaction with the customer is identified, then they are play authorized by their impact on the customer, by the importance of the customer and the business need of resolving them.

Robert Blumen 00:11:33 I have heard some terminology in some organizations where they have multiple tiers of support. Are those all tiered within the customer facing organization, or is that more like a priority or severity level that’s assigned to the incident and engineering may be in some of the tiers.

Liran Haimovitch 00:11:55 So what you usually see with customer issues, you see they’re kind of two, two access. One is tears. The other is play. Now access is just one. Access is a priority. How important issue is it usually starts out with PVO P zero means something horrible has happened and the customer is unable to use the system or even worse. The customer might be being limited in his own ability to operate due to a failure on our end, a P one is a significant outage and pick two or three, or generally minor outages or just feature requests. So that’s kind of the playoffs, the side of the things and those skipping, my dose priorities are per customer. So something might be P zero four, the least important customer or the company or something else might be two, but it’s going to be for the most important company and customer in the company. So they might be treated differently. And it’s not just about it customer priority. It’s also about the business.

Robert Blumen 00:12:55 So I may have conflated a couple of different things there, the tiering of issues with the severity and the priority. Yeah.

Liran Haimovitch 00:13:04 And now tears are essentially the expertise level. The experience of whoever is dealing with the system, smaller organizations are going to have one or two tiers while large organization may have as many as four or five with day, usually tier one and tier two residing within this customer facing organization, customer support, customer success, whatever. While a tier three tier four might live within operations or engineering till four, sometimes still five really fails to engineering itself.

Robert Blumen 00:13:37 The higher tier is our engineers who have a deep experience in the product who are potentially able to solve any issue with the product, or at least identify the cause. All engineering teams have some process for getting work into their system or their, they have scrums or planning or project planning. What is the process where a customer facing identifies the work that they want engineering to do? And how does that get into the engineering workload?

Liran Haimovitch 00:14:09 So in some ways the same new, essentially the organization has some processes in place to align all the stakeholders and kind of get everything on board. Kind of saying whether it’s based on the pre allocated. So the free sources that are dedicated for support, whether it’s by reducing the scope of the roadmap temporarily. So they do unplanned walk into the plant walk, and it’s a bit of balance. Now, many of us kind of know how tricky it is and our complexities to align on the long-term product roadmap, because everybody wants a piece. Everybody has a feature that’s the most important for them. And everybody’s kind of speaking from their own organization, positioning about what’s important. What’s not in many ways, prioritizing support issues is even worse. It’s worse because people actually have a real personal stake in it. Not only because different people are responsible for different counts and for the accounts they’re responsible, they care deeply about.

Liran Haimovitch 00:15:14 And they’ve very well know they also care personally about the people that are working with in most organizations, customer facing representatives are compensated based on their ability to deliver value to the customers. So essentially once the customer comes with a bug that may change the business impact per customer, the team walking on that from the business side of things might actually have their compensation effected by whether or not this bug is dissolved. And by how much now, obviously, as you can imagine, people are much more emotional and much more personally invested when their compensation is on the line by whether or not about yourself. Now, in addition to the personal compensation, the other thing that’s different from traditional roadmap prioritization is the time span. When working on the roadmap, you have a fairly long schedule, whether you’re working in weeks or months, or sometimes even years, it’s something that’s a bit drunk, longer term and easier to take time to make your decisions, take time to deliver and so on and so forth. When it comes to support. Usually everything’s full tomorrow, not yesterday, everything is overdue and by definition, the customer is already unhappy because something has gone wrong. And so this urgency drives the process much faster at a much higher volume of discussion and drives in general, tensions are much, much higher when it comes to support.

Robert Blumen 00:16:44 Every organization has competing uses for its scarce resources. Somewhere at some level of management has to decide. We’re either going to build new products or support the ones we have or make the ones we have better help the customers. How is an organization deciding how to prioritize customer issues in the mix?

Liran Haimovitch 00:17:10 So I guess the answer is, it depends. It depends on where the power centers within the organization lies is the organization engineering driven. Is it product driven, marketing driven? Is it sales driven? Is it somewhere else? Whoever is driving has the most capital in the organization is going to have a biggest impact on that. Usually you see that incentive driven organizations, for instance, customer issues are the most important things because of a direct impact short-term impact on the bottom line, while organization focusing more on engineering culture might not care as much about individual customers, individual customer issues. And so it depends a lot. And also obviously decision-making processes change from organization to organization is data driven. Is it driven by personality, driven by whoever is pushing it forward is driven by other elements. It can differ very greatly.

Robert Blumen 00:18:09 You mentioned a few minutes ago, the idea of unplanned work do most customer issues fall into the category of unplanned work.

Liran Haimovitch 00:18:18 I’m sure. You know, and we generally divide engineering tasks in planned and unplanned work and plan work is everything you mentioned on the roadmap side and you features upgrades to the system and so on. So forth and unplanned work with everything. We can’t exactly predict the head, but know that’s going to happen. I would say customer issues are definitely the biggest and most painful type of unplanned work for most of organizations we’ve seen it is the most business impact on the short term, it’s the ones the most guaranteed to happen. And so it’s huge driver, and unfortunately it’s even harder to predict and not only what’s the issue going to be, but which teams are going to be involved in. What is that

Robert Blumen 00:19:03 The impact of unplanned work on the planned work, things like your roadmap.

Liran Haimovitch 00:19:09 So obviously that’s going to be a huge impact. Being able to reallocate unplanned versus planned Hawk as a big impact is a huge impact on the ability to plan what you’re going to do, how you’re rolling. And unfortunately it can be tricky. The more you’re able to predict unplanned walk, the more you’re able to categorize it ahead of time and allocate specific resources to it, the better it is and Muff, I think the most important thing is how well can you handle that unplanned walks when it comes in? Is it throwing all your plans away or you’re trying to learn the system, learn the tools, reeducate the team, or do you ever procedure in place? Do you have the tools in place to tackle those issues as they come, because you’ve already seen dozens of them and you kind of know what’s going to be coming up next down the road.

Robert Blumen 00:20:06 What would that look like? What kind of procedure would you put in place?

Liran Haimovitch 00:20:09 So each organization has different types of unplanned walk in general and customer issues. In particular. Sometimes let’s say a customer is reporting about in your backend application, you’re going to need a certain set of tools to be able to investigate that you’re going to want some level of observability into the performance of the application, such as an APM, especially for suffering from performance issues on a regular basis. You’re going to want some level of flogging. You need to be able to understand at least briefly what the customer was doing when something went wrong. And you’d probably want to be able to give your team a set of tools that would allow them to see the information about the customer access, the customer profile and so on and so forth, which can be a bit tricky in production environments due to security and privacy concerns. And last but not least, you definitely want a tool that will allow your engineers, the ability to extract additional data as they learn. They need it relevant. Having to spend a lot of engineering cycles and efforts on data collection, just to resolve issues,

Robert Blumen 00:21:22 Talking about extracting additional data. Are you talking about telemetry that you could say increase log levels of logs that you’ve already instrumented your code, or in other ways, crank up the level of telemetry to collect more data about what’s happening or more finely targeted. Exactly.

Liran Haimovitch 00:21:44 We need to think of observability in general is an agile platform. We can just stick ourself with a very strict, very rigid set of pieces of data and rely on that same piece of data day in, day out to save us because we’re in the beginning, many different issues. And each of them is going to require different insights. Now stuff such as increasing global positing for specific models or users stuff such as being able to insert new load lines on the fly and all of that. Those are critical elements that are going to allow your engineering teams to collect data faster, to adapt the data they’re collecting to the issue that are handling it at hand, rather than trying to collect everything all the time, which is not going to walk. And it’s probably going to cost you a fortune as well in a observability tools.

Robert Blumen 00:22:40 I think what you’re talking about is if you’ve built one or two products, you’re aware that there are going to be issues that you get out in front of that, by saying let’s engineer into products from day one, the ability to do the tasks that we know are going to need to do to resolve the issue. So it’s already there rather than waiting until you have an issue and then asking, how do we figure out what’s going on?

Liran Haimovitch 00:23:09 Yeah. So we definitely need to design our systems to be able to collect data from them. And we definitely need to take our insights from the past, from previous projects, from the team experience to build a system that’s easy to troubleshoot, but we also need to find a way to build a journey into the system. And I know that’s what we were doing by building agility into the observability of the system. You can adapt the data you’re collecting, whether you need new metrics, new logs, whether you just need to know what a variable value, somewhere being able to use. Agile telemetry makes a huge impact in your ability to get the data you need when you need it, rather than trying to walk around and the data you happen to have, and trying to figure out how to make the conclusions you need from random data that was decided upon in the past.

Robert Blumen 00:24:06 Could you tell me a story of an issue that you were involved in that got escalated and how the engineering team approached it?

Liran Haimovitch 00:24:14 So dozens of stories, I think one of my favorites was actually one of the first customers. We deployed recover that they had an internal portal for employees to actually report it outages and handful of employees were unable to log into that portal. Everybody else were successful in logging into that portal, but that specific handful of people couldn’t have been logged in. And they’ve actually been chasing that bug for over six months, trying to figure out what’s wrong. Essentially, whenever those people came into the application, they were asked to log in via the Google account. They logged in and then essentially after the screencaps refreshing and it got stuck and then they got an error message. Now using lookout, the team kind of followed down that rabbit hole, trying to see what’s going on. And it turned out that when they were using the Google API for logging in, they’ve actually passed the flag, saying, provide us with the full profile within the JWT and the JSON web token that was used for authentication.

Liran Haimovitch 00:25:22 Then when they got the JWT back, they did a set of Santi tests on it. And of those Santi tests was actually testing for that for the size of the string. And if the string was too long, I think about over 8,000 characters, then they would trim it down to 8,000 characters. And unfortunately, as it was the signage ability for those people, for those handful of people, their profits will alter the JWT, which the prophet got packed into JWT. It was larger. And then this token got trained, essentially pushing the system into a relogging loop that ended up with an infinite redirect loop terminated by the browser. Now, the irony is within that if a string longer than 8,000 characters, they actually the comments and the logline here. This should never happen at the logline here and that logline was never added. And I think that’s a perfect example that we can’t always rely on the observability decisions that were made when the code was written.

Liran Haimovitch 00:26:25 Obviously there, there was going to be plethora of knowledge, a huge amount of knowledge in those, in that, whether it’s the information level, logging, debug level logging metrics and so on and so forth. And we want to be able to access all of that. But we also need the ability to adopt observability on the fly to collect new data, new facts contract. Sometimes we are debugging code. You know, the most of the code we’re debugging wasn’t written last week, maybe it was written a year ago or a decade ago, and we need the ability to easily re instrument it and collect new data. As we’re dealing with new challenges today,

SE Radio 00:27:04 The answer is spot by app, the cloud automation platform that makes it easy to deliver continuously optimized infrastructure at the lowest possible cost to get the most out of your cloud investments by automating cloud infrastructure, to ensure performance, reduce complexity and optimize costs. There are machine learning and automation scale to exactly meet application needs. Using the most efficient mix of instances and pricing models, eliminating the risks of over provisioning and cloud waste limited spot with all leading cloud platform services and tools, check them out at spot.io/se radio, where you can find more information, request a demo, or even start a free trial.

Robert Blumen 00:27:43 You mentioned people chasing that issue around for six months, is that a unusual amount of time to resolve an issue?

Liran Haimovitch 00:27:52 So I would say it’s not your average bug. Most bugs do get resolved within a few days, maybe two or three weeks. And once they actually get to walk on, but almost every team I’ve met with, including I have my own personal horror stories. Everybody I know has a set of bugs. They’ve chased for six months or more. Some people I’ve met have bags if chase for over a year and still haven’t solved. And those bugs just by noticing that people have kept walking in it for six months means that they were important. They were, somebody cared about them and still they couldn’t get them resolved within a reasonable timeframe.

Robert Blumen 00:28:33 When an issue is open that long, is it the same person who’s trying different things or is it issue passed around among different people or maybe even different teams?

Liran Haimovitch 00:28:46 So usually you would see those issues get passed around. So if that issue is important enough to be walked on, you would definitely want additional people to get started with it. At the very least, you would want a couple of more people from the same team to take a look, but definitely is a positions are often very big. You would find that the single bag can involve three or five or seven teams. And then you quite often have 20 or 30 people all involved in trying to chase down that bug and that’s

Robert Blumen 00:29:18 And some email we, you and I exchange before podcasts. You mentioned the drain that issues create on development resources. I can totally see that from this story you’re talking about in particular mentioned issues with technical debt and employee motivation should go more into that.

Liran Haimovitch 00:29:42 Of course. So technical debt is a loaded term. That can mean many things, essentially the bottom line. I would say that whoever saying something is technical debt, they’re saying they don’t like how whatever is being written, whether it’s a component or an application or whatnot, how the application is being built and how it’s operating right now. And there can be many, many reasons for why somebody doesn’t like how something is built. I would say the obvious too, is that something was built very poorly or that something was built in a way that, you know, difference of opinions. Whoever wrote it originally in over is working on it now. And it failed to agree on how it should be done. But for most part, those are actually, I would say the, the hill, the rest of those are the less common stuff. The two most common reasons we’re seeing for technical debt.

Liran Haimovitch 00:30:39 One that something has been adopted to walk differently. Something started out with Southern function and gradually over time moved to something else, which means that obviously it’s not doing whatever it’s doing right now. Very good, because it was never meant to that. And two, which is even more common is that people are saying something’s technical debt because they fail to understand it. They don’t have a good enough, good enough understanding of what’s going on. Why is it going on? Why it was designed that way? And so they have due to lack of understanding. They don’t like what they’re seeing. They don’t understand it. They think it could be done better. And quite often, once people truly understand how something was written and why it was written, then they don’t hate it as much. They don’t consider it as much technical debt. And obviously dealing with customer issues is a great example of that because somebody is complaining code is not operating properly. And how well do you understand the fee decode? How well it was designed to do whatever it’s doing. Those are critical terms in your ability to understand and more importantly, fix the issue.

Robert Blumen 00:31:54 So when we’re trying to tie together technical debt and customer issues, are you saying that the customer issues reveal that the company, for example, did not invest enough in design or testing or monitoring when the product was shipped? And that’s why we’re seeing the issue. Now,

Liran Haimovitch 00:32:12 I wouldn’t go as far as saying that sometimes customer issues are unavoidable. The only way to avoid customer issues is to not have customers, which is not a good strategy for most of us. At the same time, we have to prepare ahead of time. We have to think through it, we do have to do our best to design the product properly, to test it properly, to bring it to the highest quality we can provide while still delivering each fast enough in delivering value to the customer and to the organization. However, it’s also important to note that customer issues, especially repeated customer issues, significant customer issues are usually a sign that something is wrong. Usually a sign of technical debt and that technical debt might be because the code is bad or has been misused, but it can also be that we’ve kind of shifted away from the code or that the team doesn’t have as good a grasp of the code as it used to.

Liran Haimovitch 00:33:09 And so they’re making a defecation that are inadvertently adding bugs to the code. And so throughout the process, you kind of have to, it’s all about empowering your engineers to learn the code, to understand it the best possible way, and also kind of avoid too much of emotional resonance focus on the code, focused on the technical part, focus on the professionalism. And if there’s a problem with the code, if it’s being modified, if it needs to be written, which is often the crusade for every engineer, exclaiming technical debt, tech crusades, obviously rewrite it refractory to rebuild it. Sometimes that’s needed. Oftentimes it’s not. But I think the most important fact when dealing with technical debt is first and foremost, understand the code, understand design, understand what’s going on. And once you’ve got that visibility, make sure you have the tooling in place to keep it high quality. Think about testing and so on and so forth.

Robert Blumen 00:34:11 The other area you mentioned unplanned work from customer she’s being very impactful as employee motivation. What did you mean by that?

Liran Haimovitch 00:34:21 So engineer, the like building youth new shiny stuff, or this most engineers and nothing is Molly remote from building new shining stuff. Then we’re fixing customer issues. That’s essentially kind of treading in place going day in, day out and making sure that something’s, that’s supposed to be walking is walking. And that’s, there’s very little satisfaction feeling of progress in that even was while you kind of trying to fix something that should we fixed the tolerability, we walking, you’re encountering two additional issues. The first is that quite often, you have very little understanding of what’s going on. What’s the problem. When are you going to figure out what’s the problem? And when are you going to fix it? So you’re essentially in the dark and even more so you’re under a lot of pressure because as we mentioned, the might, there is a business impact on the line. It’s very short term and quite often many important people in your organizations are going to care about the business impact. And so sometimes you’re going to have a lot of people breathing down your neck about when is this going to be resolved? How can you go faster and so on and so forth while you don’t even know what’s going on? And so the ability to collect data, to understand what’s going on too, is paramount in your ability to feel in control and deliver faster and avoid the motivation.

Robert Blumen 00:35:55 Yeah. I could see if you feel like you’re doing the same thing over and over again and work, you thought you shipped, keeps coming back to you. That could be very de-motivating. If on the other hand, I had a conversation with an engineer recently who’s in ops or support. He said, he realized that early point his career wasn’t very good at building things, but if something broke, he was very good at fixing it quickly. Do you think there’s a type of person or they get joy and satisfaction out of chasing down problems and solving them?

Liran Haimovitch 00:36:26 There is definitely joy in chasing down problems and solving them. I think that joy is more fun. It’s more glamorous. It’s more enjoyable when you’re dealing with new stuff that you’re fixing a problem that wasn’t there before you you’re building something new, you find a problem. You’re fixing it. Now something is new is working and it’s not as enjoyable when you’re dealing with something that should already have been working. So essentially delivering something new is going out from the bottom to up while I’m fixing something or to the already be walking is kind of top to bottom and back to the top, which is not nearly as much fun.

Robert Blumen 00:37:07 So we’ve talked a lot about what kind of problems go to support, how they’re solved the need to be proactive in how you build the product so that you have this tools to do the telemetry. When you have a problem, you can be productive in getting data to chase it down. Are there overall some best practices for an organization? You also mentioned you’re going to have customer issues. So we don’t live in a world where you can say, we’ll ship. It it’ll be perfect. We won’t have any issues. So we don’t need to plan for that. What are the best practices for our organization to effectively be able to handle these issues?

Liran Haimovitch 00:37:47 So there are a few key concepts here. The first is to acknowledge that you are going to have those customer issues, paramount, that you know that you’re going to have them and you need to set expectations. You need to set expectations with the product team that you’re going to need to allocate some resources for unplanned walk. You need to set expectations with the business team sometimes with the customer himself, that there are going to be issues, but you’re going to be there. And you’re going to fix those issues as fast as you can. And you need to set expectations with the engineering team to make sure that they know that once they’ve delivered the new feature, they have to be there to support it. They have to be there to solve the customer issues around it. Number two, you need to define a clear process.

Liran Haimovitch 00:38:36 We’ve talked about a bit early on how are those issues are going to be coming in? Are they going to cut, becoming informal? Are we going to coming in from bugs, from buggy boards who is going to be picking them up, who is going to be decided what’s important. What’s not. And that process, you probably want some draft early on and you want to gradually walk on it and improve on it. As you’re seeing things through, and you definitely want to empower your engineer’s numbers. You want to empower your team. You definitely want to make sure your engineers have the tool in place to investigate those issues. So, first you want some observability tools. Second, you probably want some back office stores that are going to provide your engineers some insight to the profiles of the customers and the configuration of the system. And so on and third, you want to empower your teams to agilely collect data, whether you’re using CACD, whether you’re using cookout, you need to make sure that your engineers are going to have the tools to push new data collection.

Liran Haimovitch 00:39:39 And you want those tools to be effective. You want them to be secure efficient, and you don’t want your engineers stuck on missing data because that’s the number one problem. Most of the times your engineers are going to be solved, trying to find the bug rather than directly fix it. And the one thing is they need to find bugs is information last but not least. You want to improve throughout the process. We’ve talked about various parts of the process. How do you get the bugs? How do you prioritize them? Who is involved? How do you fix them? How do you collect data observability and so on and so forth. You want to learn from the process you want to learn about what kind of bugs are coming in. How can you fix them faster? Which bugs took you two months, six months. How can you make sure they’re not going to take you that long next time and keep in mind that you can often use those customer issues as feedback for the product team, for the engineering team on how to build a better product and how to empower your support team to resolve more issues as themselves next time.

Robert Blumen 00:40:44 Can you give an example of something you learned from dealing with customers’ shoes enabled you to improve how place where you worked, which could be Rook out or someplace that enabled you to improve how the organization handled customer issues?

Liran Haimovitch 00:40:59 Sure. So I took out, we provide SDKs for remote debugging and data collection, and we provide them four or five languages for a java.net, Ruby Python, and node. And early on, when we started deploying things, it was Java. For the first time we had a lot of learning curve. Every time we went to a customer and we wanted to deploy the SDK, we kind of, we ran into various issues and we, on the one hand we use every one of those issues to improve the product, to make it easier to deploy, to make it more stable. And so on at the same time, we’ve also built out a questionnaire. So we’ve built out a questionnaire that has all those key elements that we know we want to learn before we get started with the customers, whether it’s a Java versions or operating system or many other what cloud they’re on, whatever it is.

Liran Haimovitch 00:41:50 And now we’ve made that question. New. We can show with customers, we can show with our support team with engineering team, and whenever we’re dealing with the customer, we make sure we have that data. And we wandered the data before deployment. We want that data available whenever it bugs reported. And so we are much more focused when we’re walking on those customer issues, because we know what you’re facing. And throughout the process, we’ve learned time after time, how not only to resolve the direct issues we’re encountering, but also to where the product needs to be redesigned, to be more reliable, to be easier to use and to be safer. And you can see the impact, a huge impact on how quickly and safely the product can be installed today. Versus it was a few years ago.

Robert Blumen 00:42:37 The difference was between having this data and not having it. And by having it, you were able to drive more rapid resolution of issues is okay.

Liran Haimovitch 00:42:47 And even sometimes even detect them before the customer, because we know what to look for and what to test and so on. And so forth.

Robert Blumen 00:42:55 Organizations invest a lot of resources in building new products and ensuring the quality of the products. They have things as different types of testing. We’ve also talked about you can’t avoid customer issues. You need to plan to handle them and to handle them efficiently. Do you have any thoughts on how would you decide whether you have too many issues and the best path is you need to do a better job at handling them, or you need to invest more in quality upfront. So you don’t have as many, what’s the balance where you should invest.

Liran Haimovitch 00:43:34 That’s a difficult question. That’s a good question. I would say that first foremost, if you’re not seeing any issues, then something is wrong. Maybe you don’t have customers or more likely you don’t have a good enough feedback loop because chances are there are issues and you’re just there flying below the radar. And you probably want to improve your feedback loop by the monitoring, better customer communication. Learn more about the issues. On the other hand, you shouldn’t be overly worried about having too many issues, because if you dig in deep enough, if you communicate enough with the customers, there are going to be endless issues. There is always room to improve. And there many medals of opinion and many things that might be going outside of the specification of the system or the performance of the system. And you shouldn’t be overly concerned with having issues or even the number of the issues, but you’d need to about is our dose issues impacting the business.

Liran Haimovitch 00:44:34 Our customers satisfied with their product or not the customer satisfied with the quality. In many cases, you have very clear, very clear expectations. Some products have to be up for 99% of the times while others have to be up for 99.99, 9% of the time. So-called five nights. And based on the category you’re in, based on your customers, you should have some clear understanding of what’s the business value of uptime, for instance of 40 and what threshold you’re aiming for. And then you need to measure yourself against the threshold. And I meeting the five nights with our customers, enjoying my product, our customers complaining about the quality of the product, because customers might be reporting issues. On the one hand, you might have tons of bugs, but if you, at the same time, they might be very happy with the quality because you know, if you have enough customers and everybody’s complaining about minutia in smaller stuff, you don’t care about, they might still be happy.

Robert Blumen 00:45:39 Do you have any final thoughts on the topic that we haven’t covered?

Liran Haimovitch 00:45:44 No, that’s pretty much it.

Robert Blumen 00:45:46 Okay. Where would people find out if they want to know something about it?

Liran Haimovitch 00:45:52 So feel free to reach out to [email protected] and you can also find me on Twitter. Liron underscore last.

Robert Blumen 00:45:59 We’ll put that in the show notes there. Thank you very much for speaking to software engineering radio. Thanks for all of

Liran Haimovitch 00:46:05 Us, for having me. Thanks everybody for listening in this has

Robert Blumen 00:46:08 Been Robert bloomin for software engineering radio. Thank you for listening.

SE Radio 00:46:15 Thanks for listening to se radio and educational program brought to you by either police software magazine or more about the podcast, including other episodes, visit our [email protected]. You provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 472: Liran Haimovitch on Handling Customer Issues

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

SE Radio 721: Rob Moffat on Risk-First Software Development

Menu

Recent posts

Search

Search

SE Radio 472: Liran Haimovitch on Handling Customer Issues

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

SE Radio 721: Rob Moffat on Risk-First Software Development

Menu

Recent posts