Tyler Flint, CEO of qpoint.io, joins host Robert Blumen for a conversation about managing external vendor dependencies, including several best practices for adoption. They start with a look at internal versus external services, including details such as the footprint of external services within a micro-services application, and difficulties organizations have tracking their service consumption, quantifying service consumption, and auditing external services. Tyler also discusses the security implications of external services, including authentication and authorization. They examine metrics and monitoring, with recommendations on the key metrics to collect, as well as acceptable error rates for external services. From there they consider what can go wrong, how to respond to external service outages, and challenges related to testing external services. The episode wraps up with a discussion of qPoint’s migration from a proxy-based solution to one based on eBPF (extended Berkeley Packet Filter) kernel probes.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
- qpoint.io
- qpoint.io: Managing External API Dependency Risk
- Managing External Integrations in Production Environments
- eBPF project home
Related Episodes
- SE Radio 612: Eyal Solomon on API Consumption Management/
- SE Radio 591: Yechezkel Rabinovich on Kubernetes Observability
- SE Radio 445: Thomas Graf on eBPF (extended Berkeley Packet Filter)
- SE Radio 619: James Strong on Kubernetes Networking
- SE Radio 643: Ganesh Datta on Production Readiness
- SE Radio 612: Eyal Solomon on API Consumption Management
Transcript
Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Robert Blumen 00:00:19 For Software Engineering Radio, this is Robert Blumen. Today I’m joined by Tyler Flint. Tyler is the CEO of qpoint, a firm that focuses on egress observability. Prior to qpoint, he was the co-founder of three other PAs companies and was a Software Engineer at Digital Ocean. Tyler, welcome to Software Engineering Radio.
Tyler Flint 00:00:42 Thank you. I really appreciate you having me on, Robert, it’s great to be here.
Robert Blumen 00:00:46 Happy to have you. Is there anything else about your background you’d like to cover?
Tyler Flint 00:00:51 I don’t know that my background is all that important other than just, it feels like I’ve been in this space for so long that I’ve watched the cloud grow up, and I do have a funny story about containers in the Linux kernel before they were a thing. But if it presents itself, I’m happy to tell that story.
Robert Blumen 00:01:06 Well, we’re all about staying on topic here, so I’m going to pass on that and get right to the main topic of our conversation, which is managing external API dependencies. Before we talk about managing external services, can you situate the problem? What type of systems or architecture are we talking about that have external dependencies?
Tyler Flint 00:01:29 Yeah, that’s a great question. So most applications today have at least one sort of external dependency. Most have dozens or hundreds or even thousands. And so dependencies can take the form of either internal service dependencies, like a microservice type of application, or really any application that has a vendor or third party, API dependency. And so almost every company that exists today has at least one dependency on billing API or some sort of management API that they depend on for critical functionality.
Robert Blumen 00:02:05 Give some other examples beyond the one.
Tyler Flint 00:02:07 Yeah, so there’s kind of two domains. One domain is this microservice architecture that we’ve seen proliferate in the last, you know, 15 years. And two, a particular service in a microservice app. Everything is a dependency. Every external service is an external dependency. And in a large organization, usually those services are run by isolated teams that pretty much act in a way as if they’re an external vendor. And so when we look at the actual vendor or third-party dependencies, there’s a lot of dependencies that are spread across billing APIs. There’s a lot of APIs across customer relationship management APIs, a lot of automation tooling or text phone, other audio platforms. There’s a lot of dependencies lately on external LLMs like OpenAI or Anthropic. And so what we have seen is that modern applications are really a sprawl of the service dependencies,
Robert Blumen 00:03:14 You know, large enterprise that’s operating a microservice architecture. You said just now that if I work on a team that implements service A, we’re responsible for that, service B may appear to us to be external, but surely there are differences between that and a service that we buy from another organization entirely where no one there works for the same boss at any level?
Tyler Flint 00:03:40 Yeah, absolutely. The levels of accountability are different, and the lines of communication are certainly different. So probably the biggest difference that you see is if you have an external vendor, third party dependency, then while yes, you have a contract and you’re trying to hold them accountable to the terms that they have presented to you, it is incumbent upon the team to ensure that the application is resilient to the uptime and performance of that third party vendor. Because at the end of the day, while you can go make some noise and you can try to influence their internal operation, you really have to accept the uptime and reliability of that vendor. Whereas an internal service, you can go get that other team in a meeting and you can say, hey, your SLA does not meet our SLO, we have to figure out how to compromise here or else we’re going to have some difficulty. So there is a fundamental difference with vendors, not so much, and you just kind of really have to be resilient.
Robert Blumen 00:04:41 Thank you for that. Another distinction I wanted to go into is, are external services necessarily paid or are there a lot of free services in the mix?
Tyler Flint 00:04:53 Yeah, there are a lot of free services. Well, and then there’s also with free tiers, something might be free to your team and you’re going to get one level of service and then when you start paying, you get a different level of service. But there are a lot of free APIs, but more particularly free tier usage.
Robert Blumen 00:05:13 I want to now start talking about what the footprint of these services is. You said the number of external services an organization have, it could be as few as one, but range up into the thousands. That was one of my questions. Are these services accessed from data center, from Public Cloud VPC or where is the origin of the access?
Tyler Flint 00:05:36 Yeah, so in particular, there are two different segments within an organization. There’s corporate IT where you’re really trying to limit the employees and what they have access to, which is really not the segment that’s an entire industry, A growing industry, SASSY that has a lot of phenomenal products. And then where we are focusing our effort is production services. Production services that you are running within your data centers that are reaching out across boundaries across public networks. And so the connections that are originating are primarily from the various apps that have been written or workflows. So it’s really anything that is running on a server that starts to make a connection out. And so we can classify them in a lot of different ways, but primarily they’re from applications that are running on your infrastructure. They’re from scripts or tasks that run on the infrastructure.
Tyler Flint 00:06:33 What we’re seeing a lot of now is a lot of agents, AI agents that are starting to talk externally and then also, which is really concerning to organizations, is a user that has maybe shell access that is running programs that is reaching out. So there’s a lot of different sources of the connections, but primarily where we’re focused is anything that’s running within your protected environment, your production infrastructure, where you also have your most precious resources, databases containing company secrets, propriety, and anything that has access to those really needs to be considered from both the security perspective, but also performance and reliability or your reputation.
Robert Blumen 00:07:17 I expect most organizations have some kind of gating to adopt a new service. Two things I can think of. One would be whitelisting the IP for egress out of the controlled networks. And another is someone has to agree they’re going to write a check or authorized payment if you’re having a paid service. Can you elaborate on what is the adoption process? What are the gates and steps in that?
Tyler Flint 00:07:45 Yeah, well unfortunately for us, what we have found is it is very different across organizations. There are some organizations who adopt a policy, which is we are not going to allow anything to talk out. And if you want to create a new contract or use a new service, the very first conversation has to start at the door of security. And that’s the first step in procurement. There are other organizations who are a little bit more open to bringing it in to incubate, pilot something, leave security out of it. And as long as there’s some sort of handshake, we can go ahead and pilot this thing and we’re talking now to their external APIs and then down the road we’ll figure out how to incorporate that in. And then there’s all sorts of variations in between. So you know, without naming names, there’s, I can tell you there are three prominent companies that these are three common household names, and one of them essentially won’t allow a new vendor into their organization unless they’re willing to spend multiple thousands of dollars just to start the security auditing process, which certainly keeps a lot of vendors out.
Tyler Flint 00:08:50 There’s another company that has a process whereby they have to have a contract in place, and they check daily to make sure that that contract is still valid and they will literally enforce or gate their connections based on the validity of that contract. And then another organization, and I just use this for contrast and of course I can’t name the names here, but they were acquired. It was very public acquisition and part of the acquisition is you have to have a bill of materials, all of your external vendors. And when they went through that audit, they had hundreds of vendor usage that nobody knew where it started, where they came about, there was no paper trail. And so it’s just, it’s kind of all over the place. And I think it just depends on the operational processes.
Robert Blumen 00:09:35 You raise an interesting point there where I was expecting to hear about companies having a lot more services than what they knew about because of adoption. But a general thing I’ve seen in security is we’re really good at having lots of justifications for why I need to add Tyler to this group. I need to give Tyler all the credentials I need to give Tyler roles and permissions much less good at Tyler’s job responsibilities have changed, he’s left the company. We need to make sure all this stuff is revoked. Do you see that asymmetry in the management of vendors as well?
Tyler Flint 00:10:11 Oh, everywhere. And one of the first ways that that is exposed is through API tokens. So as we started to talk to companies, one of the very first things that they brought up was, can you create an inventory of the API tokens that are being used? And that way we can come in and find out if those are the tokens that are supposed to be used, or how long have they been used? How long have they been in rotation? And what we found that was quite surprising to me was that these are sophisticated teams with operational excellence using secrets management software. And even then, there’s a lot of questions as to where all of those tokens are being used. When was that token created? Who was it created for? Is there some sort of expiration that’s looming? If that token starts getting rejected, do we know why that token is getting rejected? And that really speaks to what you were just inquiring, which is oftentimes a service, and an integration is set up. And then the care and proper feeding of that integration is if it works, it works, don’t fix it if it’s not broken. And then that leads to some governance concerns later down the road.
Robert Blumen 00:11:19 I have a question, which you’ve answered what I’m going to put it out there anyway, which is do organizations tend to have a good understanding of their dependencies? Answer? No. What I’m going to ask you is tell a story about something that you happened, either happened to a company because of an unknown dependency or a surprise during an audit.
Tyler Flint 00:11:42 Actually, it’s so common. So I have plenty of these stories, but it’s so common that what we actually found is that we’re able to build it as part of our onboarding workflow that when you install the agent, the first thing we do is we bring you into your inventory and then we just wait for the surprise. We wait for you to realize, hey, what is that? Or why are we using that? Or where is that coming from? And so far, in every instance where we’ve run any sort of pilot or even an onboarding experience, they’re really surprised. So they’re either surprised in that they’re using a vendor that they didn’t think they were using, or I’ll tell you the first one that comes to mind is that there’s a popular feature flagging application that you know a lot of companies use. And the team was certain that they had no critical dependencies on it.
Tyler Flint 00:12:32 They were certain that it wasn’t calling into that API on every single request. And so they put this in, and it immediately popped to the top as their highest consumed vendor. And when they looked at that, they realized that there was a direct correlation between their own website traffic and then how much traffic they were sending out to that vendor. And it occurred to them that they had a problem with the way that their application was implemented, and it was asking on every single request, and there was no caching in between and there was no fallback. And so that’s just a recent one that comes to my mind. But the other more common one is that as soon as they turn it on, they immediately realize how many monitoring tools and solutions that they’re using. And oftentimes the question is, wait, I thought we turned that off. And it’s still running, you know, it’s still running somewhere. So it’s fun actually. It’s been fun to kind of experience those.
Robert Blumen 00:13:27 Now you’re doing a great job at answering questions. Before I ask them, I wanted to ask about risk factors. What risk do external service providers create? You’ve answered that a bit in your last answer, but could you elaborate in anything you haven’t already covered?
Tyler Flint 00:13:45 There are three main areas that we approach. So one of them is cost. There’s a big risk to cost through attribution and the most common thing there, and we see it on social media where somebody suddenly gets a bill that is a little bit more than they were expecting. And then the question becomes who’s responsible for that? Which service, which application, which process, where is this coming from? And so we bucket that into the cost and attribution. And the one last thing I’ll say on that category is, especially for companies that make API calls on behalf of their customers, there is a big question of cost and attribution. If their bill comes back from a vendor that is directly proportionate to the amount of usage from one of their customers, they need better tools to understand the risk of cost. So that’s one.
Tyler Flint 00:14:39 The other is compliance and risk from a security perspective. So exposure, there’s a handful of questions in that that we hear all the time, which is especially from CISOs from VP of security. What they want to know is who are we talking to outside of this organization? Which applications or services are connecting to them? Where in the world are these connections terminating into? And what data are we exfiltrating? Do we know what types of data are being exfiltrated? And so we’ve really focused on trying to provide some of that understanding so they can ask those questions. We do that through an inventory and governance. We show them the vendors, we show that all of the applications track that back where it’s coming from, where in the world it’s going. And we have a map of where all your connections are going to. And then also we show on the services that you would like.
Tyler Flint 00:15:31 We can add some sensitive data scanning to extract the types of data. And then the third category is really about reputation. And this is really the performance and reliability aspect. And one of the things that we’re learning a lot about is maybe perhaps I had the wrong perspective when I got into this initially thinking that it was going to be so important for teams to be able to hold their vendors accountable. And certainly there’s an aspect of that, but what we are hearing is that the burden of resilience is falling on these teams and they’re much more concerned about ensuring that their applications are resilient to the things they cannot control. So as an example, very well-known company that happens to operate software on cruise lines, runs into challenges where their network is unstable many times throughout the journey and they spend a lot of time trying to figure out if their software is reliable, is it accountable? And they spin up environments specific to test network latency, packet loss. And so one of the things that they’re working with us on, is a way to use our technology to simulate all these conditions without having to spin up and provision all of this expensive infrastructure and just be able to modulate those things directly in the kernel through eBPF. Sorry, that’s probably a lot more than your original question, but the three main areas are cost, compliance, and exposure. And then the third is reputation through performance and reliability.
Robert Blumen 00:17:05 Those are all good areas. I want to drill down a little bit into cost. One question I had is are there situations where yeah, we know about that service, we agreed to pay for it, we want it, but we’re using 10 times more of it than what we thought, and we didn’t know?
Tyler Flint 00:17:22 Yes. So we have seen that scenario in three variations. So the one is exactly what you’re saying, which is, wow, we’re using this a lot more than we thought and we didn’t realize that we were using it so much. Now we see how much we’re using it; we can dive in to see if there’s ways to cut that. And in that scenario, one of the first questions that they have is, could we implement some sort of squid proxy somewhere and do some caching so that we can minimize the amount of API calls that we’re doing on that vendor? So that’s one. The other one is the scenario where they’re not monitoring their usage and then suddenly the vendor says ìNo more, you’re getting rate limitedî. And what they will experience immediately is a massive service disruption and then suddenly turns into this wild goose chase, why are all these services offline?
Tyler Flint 00:18:14 And they have to go look in their mountain of logs to figure out what’s happening, and then they’re looking down for everyone or just me, this vendor says they’re online. And then when they look into it, they realize, oh, we’ve been rate limited. Wait, why are we rate limited? Who knows? Why are we using this more than our limits? Does anybody know what we’ve been doing recently? And so that’s the second case of being able to figure that out. And then the third is, you know, one of the most elusive of those, I alluded to this briefly, was when you are making API calls on behalf of your customers, then it gets really complex. Like our usage of this vendor, are we getting rate limited because one of our customers is using 90% of our quota or are we evenly distributed? Do we need to scale up or do we just need to throttle this one customer? And those are the types of questions that are really challenging for organizations to answer and just really expensive when those scenarios arise.
Robert Blumen 00:19:13 You talked about caching and monitoring, which I want to come back to. There’s an area I want to explore a bit more about. If you have an essential service and you can no longer use it, then are you out of business? And what does incident response look like when that happens?
Tyler Flint 00:19:32 Well, we were just having a conversation around this yesterday with a company, and they made it very clear, and this is usually what we find. There are a handful of dependencies that they would say are absolutely mission critical. And then there are other dependencies that are ancillary auxiliary, and they want to approach the relationship very differently. They want to put so much effort into the dependencies where if it goes offline, they’re in big troubles. They literally told us yesterday that was they have one dependency where if they have even a single failed request, they have to ensure that the retry of that request has been triply persisted in their batch or retry queue or else it sends an alarm to the highest levels. And that was surprising to me to hear that they spend so much time ensuring that this one particular vendor always, always works and that they have a backup plan. Whereas the other ones are kind of more like, yeah, if they don’t work, it’s good to know and maybe we can shift left a little bit and know quicker and save ourselves some time. But yeah, on these handful of these, if something is trending in a direction we want to know about it.
Robert Blumen 00:20:50 I can think of one example of a service like that would be if you’re selling something and you have a payment processor, then you can’t. So payment your business stopped. Are there other common examples of that one critical service?
Tyler Flint 00:21:06 So the one that they’re referring to yesterday was a customer of record type service. And for this particular company, relationships and customer relationships is core to their business. And so they have to ensure that anything that happens where it crosses a line, we’ve heard this as well in FinTech when there’s quite a few phenomenal FinTech companies that are creating, well not virtual banks, but where they’re presenting a banking experience that’s backed by traditional banks. And when these experiences are used, virtual cards, etc., they need to be very, very certain that all of the API requests that go back to the bank have been registered. And if they failed, that also needs to be registered.
Robert Blumen 00:21:52 The example you gave a minute ago, retrying failed requests, that’s one strategy for ensuring that critical services are resilient. What are some other strategies for resilience of critical services?
Tyler Flint 00:22:05 Well, one strategy that I thought was fascinating and kind of going off of the FinTech, and this was early on when we were just trying to formulate a hypothesis around this. And so there’s a financial company that has terminals in various salons and other locations that take credit cards and credit card payments and they then through a series of operations, relay that back to the bank API. And what they ultimately found was that it was a lot safer for them if they couldn’t have that API request go through to just bubble all the way back up, this transaction was not successful, try again. And they just weren’t able to put the resilience systems in place to be able to get the guarantees. So for them, you know, you can imagine how important it is to understand when something is failing, that means they’re not taking money and they’re not going to retry either until that’s resolved. And so for them, knowing the very moment, you know, a lot of times companies are looking more for an error rate or if the error rate hits a certain limit and in this case the company was, if a single request fails, someone’s getting paged and we need to make sure that we’re looking and making sure that was an isolated instance as opposed to a trend that’s about to make a very bad day for our financial team.
Robert Blumen 00:23:25 In many verticals there are multiple competitors. What do you think about having a backup vendor or having two vendors and if one fails, you still got one?
Tyler Flint 00:23:37 We’ve heard a lot about that. I think one of the initial ideas, we didn’t end up going this way, but one of the ideas that we heard a lot from our network was creating a way to have pluggable vendors for a specific endpoint and kind of creating a uniform API, similar to kind of what happened in the telecom space where the leader came out with the API for text messages and voice messages and then all these other competitors just kind of adopted that same API so they could reuse the same client. And that was something that we’ve heard. We haven’t gone that route, but you know, it may come back up in the future.
Robert Blumen 00:24:11 I’m going to switch tracks a bit, talk more about security starting with how are external services authenticated?
Tyler Flint 00:24:20 So the number one universal approach is going to be through some sort of API token. And then there are other layers that can be added. So one of the other common layers is to ensure that only trusted clients are connecting is you can have whitelisted IPs. Unfortunately that is proving to be more and more complex for organizations and for vendors especially where a lot of clients are now moving on cloud, they’ve got containerized workloads, IPs are changing. And so in order to accomplish that level of security, what they have to do is they have to push everything through a proxy or a subnet and then they can whitelist a range of IPs. So primarily that’s the approach. So some of the larger companies are using what they call either an egress gateway or an egress access point. And what they do in that case is they push the accountability back onto the application workloads to connect through this dedicated location and then they’ll use something like MTLS and that way it has to verify this is who you are before we’ll allow that to go out.
Tyler Flint 00:25:30 So that’s currently the two main approaches for authentication are the two layers that I should say. One of the things that we’re particularly excited about is we’ve been working with design partners to sort of push this quite a bit. So if you think about what’s happened on the inbound in the industry where for a long time there were firewalls for inbound and there still are firewalls, well then there was an explosion of web application firewalls operating at all sorts of different layers, even up at the edge. Now we see some prominent players that’s web application firewalls. And what they’re doing is they’re essentially letting the connections go through and they’re observing what they’re doing and the moment they can see something, they can fingerprint, let’s say a DoS attack or some sort of application specific attack that they can detect right away, they just close the connection.
Tyler Flint 00:26:26 And what we’ve been working on with our technology, it would be the inverse of that. We’re calling it a client application firewall. And so it runs in the Linux kernel, it does essentially the same thing. It starts to fingerprint a lot of these things, or it starts to look at the connections and what they’re doing and allows companies to create very granular, sophisticated policies that have context from say the process, the containers, the deployments, the environment variables, as well as the connection and the network layer. And so with this approach, we’re able to bring a new layer of security to these connections to allow a company to do something like say, hey, let’s make sure that only the billing team has access to our banking APIs. And they can do that by creating a policy that says, let’s make sure that it’s only workloads that are part of the following deployment or namespace, and then here are the vendors and we can detect if a connection is attempted and it doesn’t belong to all of those, then we can kill the connection directly in the Linux kernel via eBPF.
Tyler Flint 00:27:35 And they’re all sorts of fascinating use cases that we’re starting to uncover that fall in that. Just one other I’ll just real quick is there’s one of the largest companies in the world has a new, well, I don’t know if it’s new, but to me it sounded new policy where they say that if we’re going to reach out to an external vendor, whatever that API token is that API token cannot have been provided to the application via an environment variable because the environment variables are visible to anyone who can see the system or the proc file system. So what we were able to put together was a scenario where we see one, we can look at the connection, what’s going across the wire, we can look at the header, the HTTP header and see the token. And if the value of that token matches an environment variable on that process, we can kill that connection. And those are the types of things that we’re really excited to be able to dig into through our technology.
Robert Blumen 00:28:32 If I understood the description of the network traffic fingerprinting, that would fall broadly under the realm of authorization because it limits who may access a particular service. Did I understand that correctly?
Tyler Flint 00:28:48 Yeah. So a lot of organizations right now are looking to the service mesh to be able to solve these problems and sometimes that’s great, but other times it’s not the right fit and the times where it’s not the right fit, one of the challenges is that service mesh creates a lot of operational burden to the team as well as the sidecar dependencies all over. And then the other problem is that especially with a lot of large enterprise companies who have not yet moved everything on to cloud native type workloads, they’ve got a lot of heterogeneous workloads, the challenge becomes how do we create an identity? How do we enforce that identity? How do we ensure that this thing can go here, this thing can go there and it’s a lot of operational burden and there are teams that do it and do it well and we’re learning from them. What we’re excited about is to pull the barrier down quite a way. And so the barrier would be, well if you have a Linux kernel that can run eBPF, then you can run a rule set that will ensure that the right things are going to the right locations.
Robert Blumen 00:29:55 I’m going to change directions again, I want to move on talking about testing, which is a big topic. Start with developer is integrating a new service. How do they go about testing it in either their own workstation or environments they have access to?
Tyler Flint 00:30:14 The common way is usually they’ll go and get a test account or some of the really good vendors will provide sandbox accounts that give them access to things maybe virtual. And so they’ll integrate that in, they’ll run it in their workflow and verify that things are working the way that they are. And then the primary operational mode for 90 plus percent of organizations is, okay, it works, let’s go ahead and ship it. And then all of the challenges begin at that point. Once it begins, then they start to realize, well how do we run end-to-end test in our CI system? And if we do run these end-to-end tests in our CI system, how can we ensure that only the locations that we intended to use are being accessed? And so one of the challenges that teams face is the hidden cost of transient dependencies.
Tyler Flint 00:31:10 And there are certain application ecosystems that are more well known for this. And not to pick on anyone here, just there are some that are very well known for having transient dependencies. And one of the big surprises is that if you pull in a dependency and it works locally, then you go and run it in production and maybe it’s not running in production and they start to, they start to ask why and come to find out that the dependency has a dependency and that dependency calls out for something and it can’t get that. And for whatever reason, maybe the firewall policy maybe just doesn’t work, the network doesn’t allow it, and now it’s not working and there’s troubleshooting this dependency and they’re trying to figure out why, what happened and all to find out that it was actually a dependency first had a dependency on going and grabbing something else first. So the idea is that hopefully we can help shine a light on some of those things, but right now it seems like the common practicesí developer gets it working locally and ship it and then kind of figure out how things work overtime.
Robert Blumen 00:32:16 It’s usually easier to get access to the happy path. You can test that it works when everything’s good. Is it fair to say that often the error codes and what errors look like are less well documented or they don’t all appear in the testing you can do in a sandbox?
Tyler Flint 00:32:35 Absolutely. And I’ll even add one other layer of pain. So the problem will arise in that most organizations are not recording all the connections or requests and it’s very expensive, especially at a high scale. And so what will end up happening is you’ll have a user who is continuously reporting over and over to support, this isn’t working, here’s my screenshot. And the support team will look at that screenshot and they’ll say, yeah, it looks like it’s not working. And then they’ll go and create a ticket and then some project manager will prioritize it. A developer will look at that and they’ll say, well, how do I reproduce that? And then they have to go back to the guys, well, I’m doing this, I’m doing that. And then they go, and they try to reproduce it. And then so often these things get just categorized as, cannot produce and then they’ll just sit there forever.
Tyler Flint 00:33:26 And so one of the things that we are really aware of, is our ability to see the wire. So we are on the wire and in fact that’s our core philosophy is that we are the source of truth because we are on the wire, we’ve tapped into the wire, we can see all these interactions. And so with our pluggable system, we can have rule sets that look for errors or error conditions or things that are outside of the norm and it’s a lot more manageable to record the exceptions and store those. And so then what happens is these teams and this security, or sorry, the support teams, when they pass it over the wall, it can come with things like customer id. The developer can go and match that up, oh, here was the request that went across the wire, let me go and look at that payload that was sent. Oh, that’s why it’s completely clear. Then they can take that payload, they can dump it into their system and see the result, fix it and they’re on their way.
Robert Blumen 00:34:21 We’ve been talking about testing our code, which consumes the services. Should organizations adopt a posture of testing the service as well, writing test suites, load testing, error testing, whatever they can think of?
Tyler Flint 00:34:37 That is really fascinating. You know, I had not considered that. Yes, I would tend to agree with you. I think that’s something that should be considered.
Robert Blumen 00:34:48 So now that you’re considering this, could you think of from your experience, something that an organization might find by doing this kind of testing that they would only otherwise learn the hard way?
Tyler Flint 00:34:59 Yeah, one of the things that seems obvious is that API documentation tends to drift. And if you build an integration and like you mentioned, you’ve built an integration, you’re running through the happy path and you look on the docs, okay, when this scenario happens, then yeah, everything looks good, and we’ll continue on our way. Then what ends up happening is in production, you are going to encounter that scenario. And unfortunately that vendor is not going to be, it’s hard to hold vendors accountable. They are, if you’re fortunate enough to have vendors who listen, maybe they’re startups and they’re much more sensitive to things not working correctly, but for the most part vendors are what they are. And I can totally see what you’re saying that if you’re able to write a client and verify and run everything, then that would essentially ensure that your app has resilience.
Robert Blumen 00:35:58 Okay, moving on to the next big domino. You’ve mentioned a few times either organizations don’t know how much of an API they’re consuming, or you have some tooling in your product that helps with that. Could you comment generally on monitoring and observability of external services, whether somebody’s using your product or not, how should they approach that?
Tyler Flint 00:36:24 Well, I’ll tell you how they’re currently approached and the differentiation for how we look at it. Currently, monitoring is primarily integrated into applications via SDKs and there are some agents and monitoring solutions that will monitor the system itself. But primarily monitoring is done with SDKs. And so what we tend to find is that we’ll come into an organization and there may be a handful of applications or teams that have done a really thorough integration of a particular SDK and have some pretty good observability and others maybe not so much. And so one of the reasons why, and I go back to this, we go back to the truth is on the wire and you know, two ways of thinking about it. For us, we think about the truth is on the wire and gold is in the stream. Essentially, it kind of goes back to our philosophy that if we can tap into the connections and observe what’s actually going across the wire and what’s on these streams, and then we cross-reference that with meta from the system, whether that’s process, network, etc., that we’re able to provide a definitive story of truth regardless of what your team has implemented.
Robert Blumen 00:37:43 So what are any standardized service that you run or even services you get from your cloud service provider, which is a vendor, you can get a huge proliferation of different metrics, learn quite a lot about how it’s running. What are some metrics if you have to implement it yourself, what are the metrics you should try to collect from your own usage of an external service?
Tyler Flint 00:38:10 Good question. So I think, so let’s pull these into a couple of different categories. So in the category of performance, you’re primarily interested in latency and how long does it take for your application to get a response back? And within latency you want to look at two aspects of that. One is what’s the impact of the network as opposed to the time that it takes for that particular vendor to respond? And then we move into the uptime. And for uptime it’s important to not just look at the network availability, meaning a connection was open, a connection was closed, but it’s really important to actually look at the protocol level. For instance, HTTP has a lot of protocol specific context that you can’t really get from the network layer. And so diving into that is really important for uptime and then bandwidth. So bandwidth is really critical because there’s so much cost attribution to bandwidth, especially your cloud cost. And so being able to understand which vendors, which applications are consuming bandwidth, what’s the size of these payloads, and just understanding that because you are going to get a bandwidth bill and being able to track that back to a vendor cost is important for your inventory and your financial accounting.
Robert Blumen 00:39:34 You’ve mentioned a couple of times the sensitivity of different companies to the complete failure or even a single failure of a vendor API, should companies monitor failure rates, and should they page someone or file an alert if the vendor is not performing adequately?
Tyler Flint 00:39:55 I think there’s two parts of that. The first part is the answer is yes, regardless of which part we’re talking about here. Yes, it’s very, very important. The way that our world gets better is when customers hold vendors accountable and the more customers that can be armed with real data that could go back to a vendor and say, hey, we’re not getting the level of service that we’re paying for, the more likely that that vendor is going to change. And being armed with real data is the key. That’s one. But then I also think that for teams, you kind of have to accept a certain level of this is what it is, this is our vendor choice and that’s what we’re using, then we should really know what we’re working with. And if it turns out that that vendor has a consistent 3% error rate, then our application should be able to handle that and more to operate properly.
Robert Blumen 00:40:48 We’ve covered a lot of what can go wrong to some extent how to fix it. What about fixing the process by which companies adopt these vendors so they don’t fix the issues that you discover in your audit and then a year from now they’ve got a hundred new vendors they didn’t know about. What should the best practices look like for adoption?
Tyler Flint 00:41:11 Yeah, really kind of strong opinion on this one. I think what should happen is that you should have a foundational monitoring system set up so that you can run a proof of concept or some sort of trial and be able to have exactly the truth of what happened. You should be able to see the complete source of truth. This vendor in the 48 hours, 72 hours, 90 days that we were running our test, we can see that the P99 availability is this, the P90 availability is this, and that is just going to save your team a lot of time front loaded in understanding the resilience, protecting reputation, and just saving time, debugging these things. The biggest mistake that I think we’ve heard over and over is companies that assume a level of excellence and they assume that vendors all aspire to five nine uptime and only to find out that that is a pipe dream.
Robert Blumen 00:42:13 What you’re recommending then is measure the vendor, you have some data, and you decide if you can live with the good or bad.
Tyler Flint 00:42:21 Absolutely yes. Measure. And then you have the reality.
Robert Blumen 00:42:25 Weíve covered a lot of the more general issues I want to ask about something I learned reading about your product that you started out as a proxy-based design and that did not work as to the level you wanted. So you switched to go with eBPF. Before I asked the question, I’ll mention we’ve done a decent amount of coverage on eBPF on the podcast in Episode 619 most recently, but there’s several others. Can you tell the story of why did the proxy design not work out and what challenges or issues did you run in going to eBPF?
Tyler Flint 00:43:06 Oh yeah. So I’ll try to be brief on this. This was a lot of fun. But essentially with the proxy, there’s a fundamental problem if you try to use a proxy to solve the problems of clients connecting to vendors in the same way that you solve the problem of users connecting to your services, it is a long and painful road. And essentially the reason for that is when your customers are connecting to your services, you can terminate SSL using your domain that your TLS certificates that you own, you can terminate and then you can do any sort of monitoring and observability that you want there. When youíre connecting to vendors, you do not own that TLS certificate. The connections are end-to-end encrypted. The only way to get in the middle of that is to do a man in the middle with a self-signed cert. When you introduce that into your ecosystem, first and foremost, you have security problems.
Tyler Flint 00:43:59 If that self-signed cert gets in the wrong hands, anybody who is on your network can see everything that is going across the wire. Now that you’ve introduced a man in the middle, you have a single point of failure, you have another bump in the line, any instrumentation that you want to implement is now part of that bump and you add latency, you add performance issues. So we found very clearly when building our technology and trying to take it to market that the market said no, we’re not going to do that. And when we then looked at recovering, how do we recover and how do we really solve this problem? I early, early on in my career, I worked in the Linux kernel and the Solaris kernel and particularly in virtual networking. And so I was really excited about what I was hearing from eBPF. However, it had been many years since I had worked in that capacity, but I wanted to really dive in and see what we could do in particular to probe this into the Linux internals where connections were being established before encryption and after decryption.
Tyler Flint 00:45:10 And I was really interested in, would it be possible for us as these applications are pushing their data through these SSL read and SSL write functions, can we tap into that and see the unencrypted data before and the unencrypted data after? And of course we have to be very careful that we’re always only operating in that same host because you know, that way the data residency concerns, you never want to take data that was intended in one location and now bring it over to another and start to parse it. So we had to do that on the machine inside the Linux kernel where we didn’t expose any new boundaries. And I will say that the only thing that was able to push our team through our eBPF solution and all of the challenges that presented were that for as hard and challenging and difficult as that was, it was equally exhilarating and exciting.
Tyler Flint 00:46:09 And we could do things that we just couldn’t do before. And it was so incredible to be able to implement these low-level solutions and just inject them right into the kernel using eBPF. It was extremely challenging to get up to speed with how all of that worked. There are so many different frameworks, BCC, Lib BPF, are we using C? Are we using Rust? Well what about Cilium, Go, BPF and all of these different tools and having to figure that out? It was extremely challenging, extremely, even for a team that was very familiar with kind of how kernel development works and Linux internals. But now kind of coming out on the other side, I’m extremely excited to help others get into that. And the ecosystem is starting to bloom, but there’s so much that needs to be done and it’s exciting.
Robert Blumen 00:47:03 Can you give one example of something you can extract or see with eBPF that was either really cool or surprising to you?
Tyler Flint 00:47:13 Yeah, so this is something that we ended up doing. One of the challenges that we were facing is that we needed to create a coherent string of a connection. So this connection has this source IP, this source port, this destination IP, this destination port, and then we’ve got to track that or connect it up to the process that it belongs to. And then we have got to track that with all of the process metadata. And so one of the things that we ended up doing was, as eBPF is still, I would say it’s very much in its infancy and there are not hooks for everything. There’s not hooks. You can’t hook into every, there’s not well-defined hooks for all the things that you need. So to create a connection map, and we needed the underlying file descriptor to be able to track that back to the process that it belonged to and all that.
Tyler Flint 00:48:01 What we ended up doing is we ended up writing hooks into kernel functions that would receive pointers to memory locations within the Linux kernel. And we would store that in a map and just hold onto it and we would provide some sort of lookup to it. And then when a connection was established, we were able to take the pointer location and map that with like a file descriptor and I don’t remember exactly what we had in common to then go and look that up out of the map, grab that pointer location and then traverse it in a completely different part of the program. And what that ultimately did was it just made it so possible for us and to take whatever exists in the Linux kernel, we can go get it. We just have to know which function in the kernel has a reference to that pointer, and then let’s grab that pointer out, let’s store it in a map, and then later with all these different events, we can pull it back out and traverse that pointer.
Tyler Flint 00:48:56 And so that was one of the things that was just really shocking. And here’s the specific example. So when we’re trying to tap into these SSL encrypted connections, getting to before TLS, after TLS, some of the applications use open SSL, which makes it easier, but some applications are built using Golang and Golang as an example, is very, very unique in the way that it builds, and it bundles its own SSL library. And so we were having a hard time mapping up the connection that we were able to pull out of a GO application with the actual connection. And so we were able to use that technique to find the pointer and traverse it, get all the information that we needed, and then present it up into our QT a process that had all the information that we needed.
Robert Blumen 00:49:46 I’m not sure I understood all of that, but I’ll make an attempt here and see the pointer points to something. So these pointers point to kernel data structures with all kinds of information, and you were able to map out where a bunch of different things are and so that enabled you to start from what you know and then grab all the associated data from the kernel that’s useful.
Tyler Flint 00:50:10 Yeah. So another way to say that is with the way that eBPF is written, you have hooks, and you can hook into certain pieces of the system, whether that’s a function call or system calls or some sort of boundary. And you are given for the eBPF program that you write, you are given input that is very specific to that hook. And the biggest challenge that we ran into was when you don’t have all the information that you need in that hook. So essentially the process that we underwent was we were able to create other programs to tap into other things and take the pointers of things that we needed and store them in maps so that when the other programs would fire, we were able to get that information and traverse those. It was almost limitless at that point once we got in that flow, what we could do.
Robert Blumen 00:50:57 That’s very cool. We’re pretty close to end of time. Before we wrap up, would you like direct listeners anywhere on the internet? Either you or qpoint?
Tyler Flint 00:51:08 So I don’t have a great presence myself. I know that’s something that I have to work on, but qpoint is something that I’m very passionate about. The team has worked very hard. We’re really excited. So I would say go check out qpoint.io, Q-P-O-I-N t.io.
Robert Blumen 00:51:25 We will put that in the show notes. Tyler, thank you very much for speaking to Software Engineering Radio today.
Tyler Flint 00:51:31 Thanks for having me on. I really appreciate it, Robert. It’s great talking.
Robert Blumen 00:51:35 It’s been a pleasure. And this has been Robert Blumen for Software Engineering Radio.
[End of Audio]