Jamie Riedesel, author of Software Telemetry Book, discusses software telemetry, why telemetry data is so important, and the discipline of tracing, logging, and monitoring infrastructure. Host Gavin Henry spoke with Riedesel about what telemetry is, the different ways to scale it, the tools involved, security and privacy considerations, centralized logging, types of metrics, observability, distributed tracing, Security Information Event Management, GDPR, personal identifiable information, what to ship and what to store, payload types, payload sizes, telemetry challenges, cardinality, collectd, statsd, grafana, ELK stacks, SNMP, Syslog, Jaeger, LISA, telegraf, protobuf, filebeat, influxdb, cassandra, and what to think about when getting telemetry from mobile devices.
Show Notes
Related Links
- Episode 445: Thomas Graf on eBPF (extended Berkeley Packet Filter)
- Episode 428: Matt Lacey on Mobile App Usability
- Episode 409: Joe Kutner on the Twelve-Factor App
- Episode 361: Daniel Berg on Istio Service Mesh
- Episode 346: Stephan Ewen on Streaming Architecture
- Episode 337: Ben Sigelman on Distributed Tracing
- Episode 220: Jon Gifford on Logging and Logging Infrastructure
- Episode 56: Sensor Networks
- Improving Software Development Management through Software Project Telemetry
- Towards Dependability in Everyday Software Using Software Telemetry
- Vehicle Data Acquisition and Telemetry
- collectd
- statsd
- grafana
- Jaeger
- Telgraf
- Protobuf
- Filebeat
- Influxdb
- Cassandra
- ELK Stack
- OpenTelemetry
- Blog
- Book
Transcript
Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].
Gavin Henry 00:00:46 Welcome to Software Engineering Radio. I’m your host, Gavin Henry, and today my guest is Jamie Riedesel. She is a staff engineer at Dropbox or working on their hellosign product. She has over 20 years of experience in the technical field, starting with office IT, moving to systems administration and engineering, and most recently in DevOps. She was been presenting at technical conferences since 2015, on topics such as Logstash optimization, monitoring system refactoring, and how to get over your work place-induced traumas. Jamie, welcome to Software Engineering Radio. Is there anything I missed in your bio that you’d like to add?
Jamie Riedesel Those are the major points. Thank you.
Gavin Henry Excellent. I’d like to start, first of all, with an overview of what software telemetry is, it’s terms, and its history. So Jamie, what is software telemetry?
Jamie Riedesel At it’s broadest level, software telemetry is what we all use to figure out what our software’s actually doing. Quite frankly. I mean, when you’re writing software, what is, how is it that you find out that it’s doing what it should? And that’s only a little rhetorical because if it’s working exactly the way it should great, but if it doesn’t software telemetry tells you why. It’s those little print statements we scatter all over our code to give us feedback on what’s going on. It’s us instrumenting this black box that we’re writing to tell us how it’s working. And I mean that that’s software telemetry at its broadest level. And frankly, it’s, we’ve been doing it for decades and decades at the dawn of computing, software telemetry was an indicator light on the computer, showing a little processing, whether it was blinking faster, slow told you if it was doing it, the right job at the right speed. Later on, we got into paper, you know, it actually print out things or pop something up on an operator console when something went wrong. And these days we now have robust systems, centralized logging. That’s been around for a long time, probably the first major software telemetry system, but we’ve also added on metrics. It’s kind of an extension of the old operations monitoring concept, but tracing has been added on there. And it’s all been rounded up in the last few years into the concept of the pillars of observability, which is all about making your systems observable. So software telemetry are the components of your observable systems.
Gavin Henry 00:03:12 Thank you. And if we could sort of boil that down to one sentence, why do we need software telemetry?
Jamie Riedesel 00:03:20 Software telemetry tells you what’s going on in the black box. We’re building.
Gavin Henry 00:03:24 Okay. And how is it used once we know what’s going on in the black box, we’re building?
Jamie Riedesel 00:03:30 Every individual developer does it a little differently, but that’s one of those fun concepts because when you’re writing software on your own, on your own box, on your own stuff, all by yourself, you can do whatever you want by yourself. However it is, you need to know what’s going on in there. That’s creating log files, or maybe you have a different stream, or a little pop up on your website, that a little bug that pops up that gives you a rendering times for things, all these little things are software telemetry. And but thing is at very large systems, or anytime you get more than one developer involved, or if you’ve, or probably a better way to put it is like a small team is more likely to use something like a software as a service platform, our new Relic or Datadog or honeycomb type systems to be able to figure out what’s going on in their kind of across the team and use software telemetry a little bit differently in a small team. And when you get to the truly big ones, you’re talking those global data centers where you have software that gets deployed to data center after data center, as part of your canary process, you know, the Google sized software telemetry that scale is fairly different than it is down at when it’s just five people working on a thing kind of as a hobby. And by the thing is, it’s all the same thing in the end. It’s just what we want to learn about our code is the same, but just how you approach it did differ as a little bit based on how many people you have to coordinate the engineering discipline. You’re all following the languages are using cause not everything is fully instrumented for things.
Gavin Henry 00:05:02 So when we’re thinking of the general term of software telemetry, what are the key terms we should keep in our head when we start moving into the next sections of the show, for example, things like cardinality.
Jamie Riedesel 00:05:14 Oh yes. There’s like four major types of telemetry that I talk about centralized logging, which has all your logs, outputs, the events, little text-based snippets of saying what happened when your metrics are all numbers based. They’ll have generally a few little data tags to give you an idea as to where it was when this thing was, it was issued like number of pages, converted or bytes uploaded, that sort of thing. And the traces are more of a distributed tracing concept where you have four follows execution. And if traces timing in some contexts with there also I do cover a seam system, the security event, information management system. This is what security teams use. And it gets a little under when it comes to telemetry, simply because people forget the security team is there, but it’s also a kind of extension of centralized logging. So those are the four styles of telemetry that I talk about.
Jamie Riedesel 00:06:10 But there’s others. Cardinality is a major concept when it comes to maintaining telemetry systems, especially the data stores behind them, because cardinality is one of the key constraints in the databases we use for this and which database you use for your centralized logging system is usually different than the one you use for your time series databases, which is your metric systems and how those two systems respond to cardinality is very different, but cardinality also means different things at different databases. We can talk a bit about that. If you want to, I can have an entire chapter on that.
Gavin Henry 00:06:45 You’ve done a nice intro to the next section. I’ll move us on to shortly. But if I could ask the high level, because we’ll dig deep into other topics shortly, how is software telemetry different from analytics or logging or monitoring?
Jamie Riedesel 00:07:01 The thing is it’s all really the same thing. Just it’s just a label I used for that.
Gavin Henry 00:07:06 I just thought,
Jamie Riedesel 00:07:08 Yeah, monitoring is honestly monitoring is where I first ran into software telemetry, like early in my career, I came from the operations system inside and we called metrics monitoring. You know, I mean I’ve given talks and how to improve your monitoring system and how to only alarm on the right things. I mean, that’s all, that’s a metric system being used in a specific way, but honestly, metrics is an extension and a lot of ways of monitoring, but because the software engineering team said to the operations teams, like you’ve got this cool system, can we get in on that? And we started seeing these conversions of purpose, but analytics is also another way of describing these concepts. It’s a form of metrics. It’s also a part of tracing as well. You also see application performance management thrown around a lot. That is another concept that kind of merges some of the metrics and tracing concepts and a lot of ways, but it’s still software telemetry.
Gavin Henry 00:08:04 So would you say software telemetry is like the foundation term and then it gets scooped up and analytics on logging on monitoring based on what telemetry data is coming in. Is that a good way to think about it?
Jamie Riedesel 00:08:16 That is a good way to think about it. Yes. I’ve worked the term telemetry for this is fairly new for quite frankly, the most interesting product project going on right now in the telemetry space is the open telemetry project from the cloud native computing foundation, which is trying to build a transmission protocol for traces for the distributed tracing product. And they have this idea that they can extend it to do things like logging and metrics along the same protocol. Once they get thunder, it’s still being built right now. We’ll see what they actually come out with. But telemetry seems to be a new term. That’s encompassing a lot of these older, been around for ages to have a terms. This industry likes to eat it’s dead and this is one way it’s doing it.
Gavin Henry 00:08:59 I like that term. Yeah, I suppose they’re moving towards trying to standardize something. Aren’t they just briefly on this one, cause I know we both got passionate for the subjects so we could talk a lot longer than an hour. How did it evolve in relation to developments and networking hardware to where we are today? You know, cause you’ve got a lovely chapter in your book, which we haven’t mentioned yet, which is called software telemetry published by mining. How did we get to where we are today with the developments of the underlying network and hardware, do you think?
Jamie Riedesel 00:09:32 Well, this is a fun story because software telemetry, the developer feedback side of it took a while to emerge. I mean, we first started seeing some signs of it in the early 1970s, but that was a very limited era of computing printers were probably our biggest biggest Avenue for that. But at the same time, you’re also getting a bit more display technologies advancing. What I think of is one of the key turning points for our software telemetry and telemetry usage in general was the SIS log product which came out of Sendmail the three BSD thing. I don’t know, 1980. I can’t find the exact genealogy of when it started, but SIS log came out of the Sendmail project and other systems just said, Hey, we’ve got this logger here. Why don’t can we just use that sort of other projects or picked it up and started sending things to this email logger, which at the time was one of the biggest log producing pieces of software on various systems, you know, the email handler.
Jamie Riedesel 00:10:29 So that was where we sat for a long time and CIS log move forward until we got to about 2001, 2002, somewhere in there when a group of people decided to standardize this thing because it hadn’t been standardized yet. So they created a group of RFCs to standardize the SIS log protocol standards are good things because it means it’s stable. So you know what it is and how it works. And that gave so many things, an ability to predict how to emit things. Now, rewinding the clock a little bit. We have to look at the parallel evolution of network stuff. Those of you who have been in the networking world know that a lot of it ran on. We were running on serial before you’re running on ethernet or anything else. I mean the networking of the 1970s and earlier was all our as two, three, two very cereal and the networking equipment that came in in the eighties, nineties and early two thousands all use serial connections to configure, but at the same time, they also started working with network-based methods of communicating and most networking hardware use something called the simple network management protocol, a UDP based thing to talk with the centralized systems for configuration and reporting events, you know, telemetry system, quite frankly.
Jamie Riedesel 00:11:47 So that was the networking’s parallel evolution. Now, fast forward, kind of more than 2000. And we finally seeing some real convergence between the software space and the hardware space. So this law continued to be used on the software side, but a lot of hardware makers realized, Hey, this is a standard we can actually safely use. So over the first decade of the century, we saw a lot of hardware makers start adding in SIS log support to their systems, to just ease integration with the rest of the infrastructure. People were running. Meanwhile, we had software doing its own dang thing and kind of rewind back to the nineties. A centralized logging was the first real telemetry system we had. And back in those days, it was, you had a few servers, they spat out SIS log files and you copied the SIS log files across an NFS share to a shared directory that you then use tools like grep to go looking through whatever it is you’re looking for.
Jamie Riedesel 00:12:42 You know, it was very simple back in those days, not much in the way of databases, although few did actually do things like fork, their, uh, their web logs into relational databases. And that model lasted for a while. And that gave people that gave rise to sophisticated dashboards that we would recognize quite frankly, yeah. As analytics and analyzing traffic flow and raising the view of errors. I mean this very focused type of telemetry in those days, it was powerful and it kind of proved what would come later fast about 2010. And we started seeing the convergence of the old monitoring systems into the metric systems. And we started seeing products like Etsy released to stats D which is a small Damon UDP based that could take statistics information coming from software and a fire hose and just send that into whatever your, your backend was.
Jamie Riedesel 00:13:38 And that was variable and actually a good point for the product. So we started seeing the metrics there, but at the same time, the hardware stuff was still admitting in the centralized logging systems and that monitoring systems operate operating teams are using. We’re also still monitoring the network hardware. You know, how much traffic are you doing on a given network port? I mean, that’s still a form of metrics. It’s just not what software engineers usually think of as metrics that you know, monitoring. So these days in the modern infrastructure here we are at 2021, you’ve got a converged infrastructure and a lot of ways, you know, the networking team may have their own, but you have the possibility of a single interface probably, or Fanta these days that can get you network part information. It could also give you logs. It can also give you execution rates for given functions in a given data center or given rack.
Jamie Riedesel 00:14:31 You can split things down that far it’s come a long ways. And one of the most interesting evolutions I’ve seen was around probably to 2015. It’s a big inflection point is when distributed tracing really started getting a lot of mind share and a lot of ways. Cause this was the idea of using cardinality metrics to be able to trace execution across a distributed infrastructure, which is a problem that microservices infrastructures had because you got 50 to a hundred different services doing their own thing. How do you follow execution across all that? And that’s how that was solved confusingly. Those early systems were called observability systems, but we’ve since dropped that term in favor of the pillars of observability. So yeah, there’s been a lot of, a lot of movement and quite frankly, over the next 10 years, I fully expect something new to come along that I haven’t seen. So my goal is to get people thinking about the generalities. So in the new stuff shows up, you’re able to work with it effectively and implement it.
Gavin Henry 00:15:29 Thank you, Jeremy. That was a great history lesson. You do really know your stuff. Thank you. We’ve done a previous show episode two 20 with John Gifford, I think on logging and logging infrastructure, which was a good lesson and episode three, three seven for the listeners Ben Siegelman on distributed tracing. I’ll put some others in the show notes. So just before I move on to some of the topics you’ve already mentioned, I think you might have answered this one already. What does the simplest form of telemetry look like? I think I can guess the answer.
Jamie Riedesel 00:16:01 You’re probably not wrong. It’s a simple log file. I mean, it’s, it’s, it’s easiest. Every programming language I know of has some way to send texts into a file.
Gavin Henry 00:16:10 And then the telemetry bit would be getting that remote fall to somewhere central. Or is that just after the fact,
Jamie Riedesel 00:16:18 You know, if it’s a truly a single system box, you know, your, your telemetry may be simply just looking at the file, whatever tools you have available. But if you get a multi machine infrastructure, then you want to try to start centralizing it. That gets some more interesting infrastructure.
Gavin Henry 00:16:31 And what’s the most complex you’ve seen
Jamie Riedesel 00:16:33 Complex. Yes. I’ve heard of, uh, some very interesting complex ones complex comes up when you get these sort of global spanning SAS products where you do follow the sun, the deploys and your Canary deploys evolved millions of users because a million is actually small for you. In those circumstances, the telemetry systems tend to be very full of aggregation summarization and just sampling simply because you can’t catch the entire flood of it all. So in these situations, you’d get data centers have their own pools of high resolution telemetry. They get summarized up the tree into kind of a higher regional or maybe global level tiers where you’ve for the grand highest. The apex thing that builds the dashboards on the wall, the headquarters are showing you highly summarized information for how everything’s working. Those systems are incredibly complex and those types of systems are the kind you get. When you have an entire department of engineers, whose job it is to make this stuff work. And you don’t see those very often, unless you see those at the biggest companies.
Gavin Henry 00:17:41 Perfect. You can only keep those for so long as well. So it depends on your time history, right? Perfect. Last move us onto the next section. So we’re going to talk about the types of software telemetry you work, which you’ve touched on or drill down into some system components I’m around off the show with talking about telemetry data and resource strained, personal environments. So now that we’ve covered off the basics and history, I’d like to explore the main types of telemetry discussed in your book, which is called software telemetry published by mining, which I’ve already mentioned. The first point I’ve got is centralized logging. Can you give me an example of that please?
Jamie Riedesel 00:18:22 July’s logging probably the most famous example people have heard of before is the elk stack, which is the elastic search log stash and cabana that elastic.co has been, has made famous over the course of the 2010s. It wasn’t all elastic at first Logstash and Gabbana, both were outside. Projects were later acquired, but that is the idea is you have all of your logs flowing to a central place that you have a graphical UI of some kind to be able to search, create dashboards and bisects, resect, whatever you need to do to figure out, just dive into that information. And the value of centralized logging is that presentation. It’s all in one place, you can search all of it there. And that gives you the best chance to be able to learn something from the, your pile of logs. Uh, the, the old school method of just putting a bunch of files on a, on a NFS share somewhere that you had to mainly cut through with text-based tools, maybe more purely Unix, but as a terrible user experience. And most people generally didn’t bother.
Gavin Henry 00:19:27 Thank you. And, and of course they’ve caused a bit of upheaval with our license change this year.
Jamie Riedesel 00:19:33 That’s a whole thing.
Gavin Henry 00:19:36 We wouldn’t talk about that, but yeah. So the next one would be the term metrics. If you could give me an example of that, please, metrics
Jamie Riedesel 00:19:45 Is all about sending numbers. It is the low it kind of in the evolution of telemetry or card, a centralized logging was the first one. So I’ll text-based metrics is the next one is people realize that if you can kind of reduce your telemetry done to numbers, numbers store very conveniently compactly, and you can store a lot more of it online as had been proven by the monitoring systems operations have been using where with centralized logging, you rarely get more in a month or two online with metrics because of how they compress it. Better summarize. You can get whole years of this stuff. And once you have that kind of long baseline, you can learn so many things about your, how your production systems operates and other things. So yeah, a metric systems are all about.
Gavin Henry 00:20:28 So that would be more like how quickly something’s happened, how long many things have been used, load all that type of stuff that’s particular to your application or sector counters
Jamie Riedesel 00:20:40 And timers are the two most popular types. How, how often a thing happens and how long
Gavin Henry 00:20:45 Did it take? I you’ve mentioned observed observability a few times, but what is this observability?
Jamie Riedesel 00:20:52 Observability is an umbrella term for logs and metrics and traces. And it’s been an interesting ride just because, like I mentioned, in the 2015, 16 or thereabouts, it was actually bandied about as a separate telemetry style. But people realize that no, actually observability is kind of a methodology rather than a specific style of telemetry. It’s about integrating these three pillars, the logs, the metrics, the traces into something that can work together, reinforce each other to give you the most visibility into what your production systems are doing.
Gavin Henry 00:21:25 Did that come out with the Mark marketing department or the engineering department?
Jamie Riedesel 00:21:30 Ha uh, that’s a little bit of the marketing. I believe. Uh, honeycomb IO is probably the most famous vendor in the observability space. They probably are the first to try to market on the term, but the word itself has been gone and come more of a discipline as opposed to a specific style,
Gavin Henry 00:21:48 I suppose, like dev ops or something, you know?
Jamie Riedesel 00:21:50 Yeah. Yeah. As someone who has dev ops in her job title, that is something that has a lot of opinions behind it.
Gavin Henry 00:21:58 Cool. Uh, you mentioned this as well, distributed tracing. Could you give me an example of that? And I’ve also got my notes here. The open tracing project open tracing to IO.
Jamie Riedesel 00:22:10 Yeah. Open tracing open telemetry is the more is kind of the successor to that one, I believe. But the idea behind distributed tracing is to, again, embrace the graph power of gooeys when you’re, when you’re writing a program, you open. So when you enter a function and one of the things you do, if you’re using a tracing system is you kind of use the SDK to Mark, okay, this I’m going to trace this pot. You do some work and then you close the trace and it reports it on the back end. You just instrument your code this way and just wrote trust in the open trace, the distributed tracing automation to help you. So on the other end of that, when you’re looking back to see what happened, when all that code markup you’ve done, and the traces that come out of that can be re you can use that to reassemble.
Jamie Riedesel 00:22:53 What is functionally a stack trace for your distributed system, where each function was waiting for output. You have what called in there. It’s a complete call stack. And if that happens to be involved systems that run different platforms or even different continents, you can still have it assembled on the same interface, which gives you so much intuitive power when you’re looking at it because it surfaces so many things. An example I use is like if an upstream process has a fault, but can still reports code a downstream process can return an exception on that one, the centralized logging system, you just see the exception, but none of the context that came before it in a distributed tracing system, you’d be able to look at that interface and see that this a previous system through some weird stuff. And it’s very visible. And you can look at that earlier execution and see what it did very quickly. Whereas with a centralized logging system, you’d have to have some fluency with a centralized logging system to be able to know how execution usually flows and kind of walk back the execution to where those logs are kept. Whereas it’s distributed, tracing it, surfaces, those weirdnesses are easier. And what did
Gavin Henry 00:24:03 You mean by it’s still producing code
Jamie Riedesel 00:24:05 You’re instrumenting code. All right.
Gavin Henry 00:24:08 Okay. Yeah. It’s still spitting out something. So you’re distributed tracing. You could imagine that like a Google maps, you could have a high level view that might be the interactions between microservices or functions as a service, for example. And then you could maybe zoom in and get the tracing then the same function or series of method calls or classes. Is that the, so I think that’s probably
Jamie Riedesel 00:24:31 Cool. Yeah, that is, it’s very powerful stuff. If you haven’t used it before, it can really blow your mind. When you finally see it working on a product that, you know, it’s like, wow, this is all put together in one spot. You know, the system Mike I use at work, we have a lot of the basics for, for some of the distributed tracing concepts, but we don’t have it wired together quite yet. But thing is, is that you can get the same kind of detail from centralized logging. You just have to be far more fluent in what your logging looks like, and what’s your, how all your operational flow is to be able to get that sort of detail distributed tracing when done correctly means someone needs far less training and prior history and intuition to be able to find the good stuff. It’s powerful stuff.
Gavin Henry 00:25:17 Thinking with your example for centralized logging, there had a situation where an error period in the central logging, which happened to be an elk elk stock. And there was a clear error couldn’t paste this. I couldn’t do this because I put together the system. I knew what, what happened before that. But if one of my colleagues were looking at distributed tracing system, it would be perfect because then you can actually see what it was trying to do before that, you know, what time of day it wasn’t all, all those different stages. So I totally get you an example there. Thanks. Great. The last one in this section would be what you’ve mentioned already sign or SIEM security information event management. Could you explain that and give me an example.
Jamie Riedesel 00:26:00 This is one that’s kind of a black box to most software engineers because they don’t work with it quite frankly. Uh, this is a specialized form of centralized logging, no
Gavin Henry 00:26:09 Software engineers. I think of security.
Jamie Riedesel 00:26:12 Yeah. I know some security people have tried to do software, but the thing is, is that these systems are kind of are very specialized for the security mission. And that falls directly out of regulatory frameworks that require certain level of sophistication for your ability to trace who did what, when, what did they do? When did they do that? And how did they do it? You need to be able to, for these regulatory frameworks construct, what an operator did. They logged into this system, they use Sudu to do these five commands or someone logged into an admin portal that had these nine things of which four of which did and be able to build those traces SIEM systems are functionally very similar to centralized logging. But the thing is, is that they’re also using very tightly focused telemetry signals, visa like audit logs coming out of your windows and Linux systems.
Jamie Riedesel 00:27:04 These are command logs coming out of your software as a service providers, they’ve provided audit logs such as that. And they’re all unified together in a way that allows security people to trace what happened. Use this to set alarms for things that happens. An example for us is that if someone logs into the root account of our Amazon account security gets told that as a seam function. And that’s because logging into the root account is something we don’t do routinely. So every time it happens, someone needs to be told just in case someone’s going to do some bad stuff. So it seems systems are kind of like centralized logging. But the thing is, is that there, because of the security function, they need to keep out online way longer. So they have to be highly selective in what they put in there.
Gavin Henry 00:27:49 Perfect. And I suppose that depends on, again, the sector of the product, the distributed tracing is that log instantly shipped across the network or saved locally or, you know, so there’s a lot to think about there isn’t there
Jamie Riedesel 00:28:04 Quite a lot. Yeah. And that, that leads to some very interesting discussions you can have internally when you’re looking at this stuff because different sectors of the industry sectors have different integrity requirements for their tracing or their events and their telemetry. And you can definitely do things like create an end-to-end digitally signed telemetry flow. You can do that. You have high integrity. You know, something’s been tampered with pretty much immediately cause he stopped getting traces. You can do that, but the trade-off there is fragility and that’s, uh, that’s a business risk that each organization needs to make for themselves. And somebody’s, uh, different markets have different different requirements, quite frankly, finance and healthcare have very strict requirements where selling lawnmowers, perhaps, maybe isn’t as strict
Speaker 1 00:28:54 At O’Reilly. We know your tech teams need quick answers to their most urgent questions. They need to stay on top of new tech developments. They need a safe place to learn the technologies, your company adopts, and they need it all 24 seven. Well, they can get it all at O’Reilly dot com with O’Reilly online learning, your team gets live online courses, tons of resources, safe, interactive scenarios and sandboxes and fast answers to their most pressing questions, visit O’Riley dot com and request a demo.
Gavin Henry 00:29:25 So I’d like to talk about example telemetry system components now, and discuss breaking down a complete telemetry system. What is the best way to do this actually? Should we pick a sector that’s clearly quite simple to break apart? Or should we pick a web app? For example, that’s probably easiest.
Jamie Riedesel 00:29:44 App is probably, uh, probably the most easiest for people. And I can kind of walk you through the growth pattern from like small to global if we need
Gavin Henry 00:29:53 That’s perfect. So if we take a web application, so a front end and a browser database on a couple of machines, shall we say, what would the key components of a telemetry system be in relation to that?
Jamie Riedesel 00:30:08 Something like a, for like an early small system, you’ve got like maybe five people working on it. Quite frankly, at that stage, almost all of it’s going to be software.
Gavin Henry 00:30:17 So just pick something online and I can go for it.
Jamie Riedesel 00:30:20 Yeah. Your JavaScript application is probably going to someplace like century. And if you’re your backend code, maybe go into new Relic or Datadog or something like that. And you’re not going to be spending any time building telemetry products, you’d be marking up your code for that, but you’re not going to be doing things like picking data stores and figure out how to transform it along the way. You’re just not big enough yet that stuff happens when you get more talent. And when the engineering organization gets big enough that you have the skill to be able to maintain your own internal. Yeah.
Gavin Henry 00:30:52 Oh the scale. How do we choose what to store? Um, what to ship
Jamie Riedesel 00:30:57 The smaller scale? Probably the biggest constraint is cost each software as a service vendor. I know of sets their pricing based on ingestion rate of some way size in some cases. So a lot of ways is let’s put it a different way and engineers will instrument literally every line of code if they can get away with it just so they can fully trace it, get that sort of virtual debugging experience, you know, set debug mode and trying to step through the code one line at a time, if you need to, that is the worst case when you don’t trust your code one bit. But the best case is you get single event when everything goes completely in, you’re all done. But every environment, summer between figuring out what to send, has to do with balancing the costs of ingestion versus the cost to your application for actually meeting that stuff. Because every time you emit an event or a telemetry, it’s a little bit of parasitic load away from your production applications. So it’s a big balancing act.
Gavin Henry 00:31:54 So yeah, my next question was, are these the same by guests? If you’re using a software as a service system, things that you ship could be the most things important that you need to find quickly urgently and the things that you store could be more longer term. So a good way to think of that, possibly. Yes.
Jamie Riedesel 00:32:12 Uh, the software’s a service provider method is like you’d send real-time events to them because you want to be probably building dashboards somewhere and may even have alarm setup. But for longer term, that would be throwing a bunch of events into something like an S3 bucket, where we’re just sort of throwing it to the future and trust that your future people know how to purse this stuff. And if you need to dig through it, someone may have some scripts to bisect and resect and reassemble if you need to. But it’s very much an ad hoc thing for that, that cold storage is not built out at the smaller stage. I’m going to get you through the medium and larger stage. Your engineering org is bigger. You probably have some more dev ops or SRE type people around. And that means you may be able to handle self hosting.
Jamie Riedesel 00:32:52 Some of this stuff, you know, work about work and that if you will, and that point, yeah, that, that then you can start having the discussions of do we want to inboard stuff from the software as a service systems because we figure we can maintain it better, easier, cheaper, for whatever reason that threshold is different for every organization. Some places stay with software as a service, as a strategic move, all the way up to global dominant stage. The kind where when the sales representatives are in town, they take you out to dinner every time because you are such good customers.
Gavin Henry 00:33:21 Hmm. Yeah. I guess that’s a business decision. Isn’t
Jamie Riedesel 00:33:25 It? That’s what it would be at that size. In
Gavin Henry 00:33:28 This example, um, with the web application, where should telemetry data get admitted from and save too, if we’re using SaaS, we just save it on the service, but where should we,
Jamie Riedesel 00:33:39 Yeah. For SAS, for the, like for the smaller or medium size web application, it’s almost always going to be some sort of web API, some kind, because that’s the kind of the native language of web applications is HTTP API APIs and the SAS vendors have ways to do that. You just ship it through there. It gets interesting. When you get into these like restricted environments, like governmental stuff for, you have to work everything through a proxy and you can’t talk directly to the internet that gets some fun. You get some fun engineering when you get those sorts of circumstances,
Gavin Henry 00:34:11 Should it be shipped real time or embarks, or was that a decision you need to make?
Jamie Riedesel 00:34:17 That’s a decision you need to bake, but keep in mind that the longer telemetry is on whatever produced it’s that state you’re keeping and that state can be modified by an attacker. So if you’re trying to build defensible flammatory versus an attacker, you want to move telemetry as fast as possible off of whatever produced it. So that is a design goal. You should be aiming towards is to evacuate that stuff as off, as much as you can, like if you’re in containers or functions as a service, you can’t trust that local state anyway. So you’ve got to evacuate it just because of your development model. But if you’re on actual servers, you still want to do this because if an attacker gets on the box, the first thing they’re going to try to do is try to modify the traces, that show they were there.
Gavin Henry 00:34:59 Okay. Yeah. Should this be always on? Which I think you’ve just said, or should we trigger it on demand?
Jamie Riedesel 00:35:05 Very much. Always on, I know some organizations try to, if, especially if they’re on actual servers, like have a once an hour Cron job or a windows task manager job, just to copy a log someplace else. And then they’re batch ingested. At that point, I used to actually see this model more often in companies that have things like branch offices or retail, or they have like a computer closet somewhere before they all went online. Anyway, you’d get these sort of hourly or months a day, batch uploads of telemetry from these areas into the central board. And it, at that point, but with networking the way it is these days, we’re seeing that model less and less.
Gavin Henry 00:35:42 Yeah. I mean, you just know someone’s going to hop in before the day is shipped and you can’t do anything about that. So what happens if it all stops working? Do we care? Do we need to monitor our telemetry
Jamie Riedesel 00:35:54 Very much? Do I have found as someone who manages this stuff, when the log stopped flowing, I get called on Slack, Hey, uh, logs, aren’t showing up as something wrong. I mean, people will come at you as soon as they can because software engineers love that feedback of what’s going on in production. They’re responding to bugs, coming from supports and to do that, they have to research your telemetry systems, all, all three kinds. And if they’re not there, their job has stopped and that’s a blocking workflow need to fix it. So yeah, you definitely do. This is a system you do need to maintain a visibility for. So if you’re doing service level agreements and service level objectives, your telemetry systems need to be are subjected to that just as well, because a large number of your organization rely on this to do their job effectively. And if it’s gone, they, you get the word roadblocked, roadblocked, roadblocked all over the place. And no one likes blocking, especially in agile organizations.
Gavin Henry 00:36:49 And do you ever find yourself just going back to these guys, are you actually using the logs now? Why do I need to be bothered?
Jamie Riedesel 00:36:57 There’s that temptation, but, uh, I’ve, I’ve also been myself using them because a thing is about, okay, here’s one of the secrets about ops type roles is that we’re kind of the backstop. I need some help troubleshooting some things. So we get tagged into a lot of weird and different things. And if we stick around for a while and I’ve been at this job for just over five years, so I’ve seen some weird things, we just sort of build that intuition about what normal looks like and what abnormal looks like and the various ways that things fail. So yeah, I’ve been in those systems too. And if they’re not there, I feel cutoff as well. So yeah, they’re critical systems that we keep up and we need to have reliability guaranteed too simply because they are their decision support systems. These telemetry systems at their core are what people who maintain software systems use to determine if they’re doing the right job.
Gavin Henry 00:37:48 Yeah. It’s I suppose it’s like driving a car without any dashboard. You know, you need that constant feedback at this scale. What open source tools do you recommend that someone could get start with, start with, to get this data into the SAS providers
Jamie Riedesel 00:38:04 Is that every SAS provider provides an SDK of some kind for the various languages. I recommend definitely recommend going there. But when you get to the larger sizes, you get to interesting discussions about data stores. And I’ve talked about elastic search already. It’s a famous product for keeping a centralized logging, but also the another component of the cloud native computing foundation Yeager is a tracing product can also use elastic search as a data store. So elastic search can just write that down
Gavin Henry 00:38:36 Yeager and elegant.
Jamie Riedesel 00:38:38 There are two ways to use elastic search for and, and logging and a very large systems. You know, those kinds that use multiple data centers use Apache Cassandra as a large scalable data store for this stuff. I’ve seen that used for metrics. And I’ve heard of it being used for, for logging. Yeager can also use Cassandra for tracing. Although I believe they prefer you to use elastic search because of last search does some of the indexing, so they don’t have to, and it does adjust to. Okay,
Gavin Henry 00:39:09 Great. So should we, I think these will all be this type, but should these be pushed based or pool based, or am I thinking about monitoring,
Jamie Riedesel 00:39:19 Talking about monitoring and that is a great question. Thank you for asking it, looking at it from the metrics point of view, push based monitoring and pull based monitoring is for pool based is you have a central system that just pulls a bunch of things and asks, give me your status, give me your status. Give me your status on a schedule. Push based is, is when the system that produces the whatever tells a central thing. I did a thing. I did a thing. I did a thing. There are two different ways to get at the same kind of data, but they give you different shapes of data for things like system monitoring. I mentioned before about network switches and their performance, that’s almost always done via poll based monitoring. Like once every minute, two minutes or however long your normal is it polls everything to see how much work it’s done. And it reports it into a centralized database. And you see that far more in like infrastructure monitoring, like network hardware, servers, themselves, the server hardware, power units, stuff like that. You see that as pull based monitoring most of the time.
Gavin Henry 00:40:17 And why do you think that is? Because a lot
Jamie Riedesel 00:40:20 Of these things don’t have an ability to push or if they do it’s through SNMP, which is kind of icky to use outside of networking. So it’s a lot easier to use something like a collect D or Telegraph to, to be able to do the polling.
Gavin Henry 00:40:34 Just give us some ideas here of what a telemetry payload should look like. And these types of small situations. Yeah.
Jamie Riedesel 00:40:41 Payload, especially for the small situations, it’s whatever the SDK is building, which is almost always Jason, quite frankly. But when you get again to the medium and larger sizes, when you’re doing some centralization, it really does depend on your specific organization. Jason, serializes very fast. We spent the last decade making that extremely fast, everywhere we can get ahold of it. So Jason’s fast, but I’m also seeing some more usage of proto buffs protocol buffers. For some of this stuff, vertical buffers are a bit controversial because they are very static
Gavin Henry 00:41:14 Format. That’s not what they’re supposed to be used for is that yeah.
Jamie Riedesel 00:41:18 Protocol buffers are supposed to be for static formats with this rigid schema. So you can’t really encode a variable array of fields onto something with a protocol buffer. But for example, if you’re doing a metrics format, that three labels and a value, if you know, that’s the format, you can actually use protocols for that, but I’m seeing more and more portable it’s in there perhaps as a carrier format for something inside, perhaps they need to integrate with the GRPC system. For example.
Gavin Henry 00:41:48 Yeah. Robert show’s editor the show on GRPC and they talked about the benefits of pro buff because you can version some of the fields. So maybe that’s why people are reaching for that. And at this scale, so the web app using a software as a service centralized place, what size is that payload? What does it normally look like in volume? They have only,
Jamie Riedesel 00:42:13 It’s typically very small and again, the SAS type provider is it’s going to be a Jason document and unless you’re encoding a ton of stuff, it’s going to be probably less than a couple of cases.
Gavin Henry 00:42:23 And that’d be something they’re trying to push to keep it not too massive. Yeah. Yeah.
Jamie Riedesel 00:42:27 You keep it small to keep the parsing buffers, you know, but to keep the job of the parser on the other side, to keep that job easy, just have to increase your event throughput,
Gavin Henry 00:42:37 As a rule of thumb, how fast this data ship is that again, down to the STK,
Jamie Riedesel 00:42:42 It’s a lot of down to the SDK, but again, you do want to ship it as fast as possible. Most SDKs language support permitting do try to do an asynchronous send on the background just to avoid blocking the production threads. Not every language supports that though. Yeah.
Gavin Henry 00:42:57 I was reading into not quite telemetry, but Google analytics on some of the mobile apps you have to specifically enable develop a mood so you can get stuff in real time. But outside that in production, it will batch it up and send it once an hour or something. But
Jamie Riedesel 00:43:13 Yeah, mobile is the one case where you still have a lot of batch mode because he can’t trust the network and you can’t have continuous real-time monitoring because that’s bad on battery device. So you have to build your stuff to be able to batch it out. And importantly, you need to be able to accept out of order arrival for telemetry coming from places because you can get stuff that can be hours old coming in right now.
Gavin Henry 00:43:33 What sort of data volumes are we talking about at this level?
Jamie Riedesel 00:43:37 The data volumes are very different, frankly, like for a small web application. You’re probably looking at somewhere between, uh, I think of it as events per second. You’re probably maybe less than a thousand events per second for, for like a small application. Like I’ve, I I’ve personally seen systems up to 35,000 events, a second that I’ve managed and I’ve heard of them getting over a million events a second for truly large systems.
Gavin Henry 00:44:06 Wow. So this level, when you’re looking at the different SAS price plans, I guess you have to do some sums and say, right. If the SDK says the pocket is one K, which was bank for Jason, I suppose, how quickly does that ship? And then, you know, how much history do I want to just multiply that by seven days or whatever it is, is a good way to think of that. Yeah. SAS provider plans
Jamie Riedesel 00:44:32 Are a whole lot of spreadsheeting work. I’m afraid. Yeah. Just knowing how your application
Gavin Henry 00:44:37 Yeah. And then keep a credit card on file with a big balance. Okay.
Jamie Riedesel 00:44:42 Unfortunately that seems to be how the small, very small space works.
Gavin Henry 00:44:46 Yeah. And in the real world. So star business, any type of business, what’s a good rule of thumb for how long they should be kept or is that sector specific again, a little bit
Jamie Riedesel 00:44:57 Sector specific, but for generally, what I have observed is that centralized logging data, which is mostly strings based, which requires expensive database and tends to be your biggest data store, you rarely see that preserved more than about 30 to 60 days, just because it’s expensive. You know, metrics data can go back 10 years, frankly. It may be highly summarized by the time it gets to the cent 10 years. But you can see, you can keep it that long economically it’s way easier to do. Yeah. For traces, they have the kind of the centralized logging problem. There’s a high cardinality for that kind of data, which really puts down downward pressure on your retention policy. So like the software as a service providers will give you 30, 60 days. And if you want more, you can negotiate it, but it’s going to be money.
Gavin Henry 00:45:44 What makes this data story to expensive? The complicated string based data?
Jamie Riedesel 00:45:51 Yeah. The, it comes down to cardinality again, I’m going to use elastic search for this because it’s famous for the elk stack. And what makes that expensive is because of how much analysis Elasticsearch does every time it ingests a new document, because if you have a logging statement that just said set up account eight 95 and team 42, very natural language stuff, but Lassek search does with that English language string is it runs an analyzers on there to figure out where the words are. It tokenizes them. So you can actually create the terms, account number created in team number. And so you can in your search and search interface, put in created team number and get all of the count creation events, just because Alaska search is doing all the string processing for you now to do that string processing, it has to do a lot of backend indexing and storage, quite frankly.
Jamie Riedesel 00:46:49 So it’s complex and keeps all those terms around gets very large and very large ends up being very expensive to keep around. And that’s why you see that problem. Now contrast this with numbers where you have, you wouldn’t see the account Creighton team, whatever there you’d see created account account creation counter. So an increment, a counter in a given microsecond, and you’d have a couple of tags saying what type of account it was and which data center. So you’ve got little tiny bits of data. You got two tags in an a and a one, and that’s really tiny at increments counter. It’s already in the database and you’re all good, you know, really, really tiny. So it makes it very cheap to store.
Gavin Henry 00:47:32 I guess that problem gets harder. The bigger you get as a company very much though. Thank you. I’m going to move us on to the final section of the show. We’re going to discuss telemetry data in the raw resource, constrained personal environment, so different from your box done, but obviously it’s quite so and challenges. What should we care about and low power or poor bandwidth environments. So let’s take an application on your smartphone
Jamie Riedesel 00:48:00 For those environments. You’re looking at, this is a different approach to telemetry simple because you’re dealing with you don’t have the real-time assumption you would in a data center. So you need, you want to track as little as possible because you’re working with mobile providers and as Apple has proven their threshold for it counts as private information and what’s not private information changes. And if you suddenly find your telemetry streams blocked behind a new privacy control, your engineering org is in a lot of trouble. So you want to be able to anonymize as much as possible and do as little as possible. Like the very minimum I’ve seen is like crash reporting because you need to know what crashes are and what systems are. So you can batch upload your crashes in the background. I think that’s the minimum, but to actually usage metrics, you’re looking at again, trying to keep as little as you can get away with as possible. You’re not trying to maximize the throughput. You’re trying to minimize it because every time that thread is active, you’re burning battery. And every time that thread is active in the foreground, your pertinent data too, if you’re reporting, so you want to minimize it.
Gavin Henry 00:49:03 So could you use the tracing style of telemetry on this, where you could flick a feature toggle and get some proper data, or do you just have to not think about that at all?
Jamie Riedesel 00:49:13 Application tracing is its own thing. Cause again, you have that cause tracing is also a very, fairly high volume tech unique. When you look at for things like a data center, because you’re getting a lot of discreet events out of that. Anytime you enter exit function, you’re potentially creating an event that will have to have your report at home. And again, that’s the kind of thing that can get costly in terms of battery and phone home, and might get banned in the future, a future privacy controls. So you have to take a kind of a different approach to it. You definitely have your developer builds, do that. You can do that stuff that gets sideloaded and tested on different things for when you’re qualifying your hardware. You can use that to try to figure out how things work. But again, in does resource constraint and departments, the overhead of having the tracer in place is that Heisenberg uncertainty principle problem, by observing the system, you change the system. So you mostly get an idea as to how it’s usually supposed to run, but that’s what the tracer turned off.
Gavin Henry 00:50:09 Yeah. You could miss skip a part of your program or something like that. I guess if you’re doing some customer support, you could get them to go to the settings and debug section and switch something on for you. Just ask them. Well,
Jamie Riedesel 00:50:21 I’ve seen that model used a few times.
Gavin Henry 00:50:23 Yeah. Because a lot of apps will support. We’ll ask you to switch on and then send the log files to the email system on your phone, which is obviously a bit slower, but it’s still a form of telemetry, isn’t it?
Jamie Riedesel 00:50:36 It is. And it’s a kind of opt in telemetry because the user has to take active action, which is much easier to defend when it comes time to Google or Apple saying this is not private information. You can’t have it.
Gavin Henry 00:50:46 Yeah. Um, my next question, I think you’ve touched on already in a mobile environment, on a modern smartphone. What is useful data from an app? I think you’ve already said the crash data.
Jamie Riedesel 00:51:00 Yeah. Crash data is the absolute minimum because that tells you your compatibility with different hardware models. This is especially important to the Android space, but yeah, it’s the crash date is absolute minimum and you typically have a different stream for telemetry, for crash report data than you do say like applications. Polemetry simply because the different functions, you know, you’ve got a crash report. There’s actual API APIs in the, in the iOS. I believe for some of that stuff that you can hook into. So you report it differently. But for application level data, again, it’s like, how little can you get away with,
Gavin Henry 00:51:32 And how different is this from collecting logs and the backend service of the hop?
Jamie Riedesel 00:51:37 It’s a, it’s a very different thing because when you’re collecting logs in the backend servers of the app, you can actually assume always on networking and it’s relatively high speed, which is a different architecture completely. So you can have a process like a elastics file beats sitting on your server that just sort of watching each file, waiting for new line to show up, that’ll pick up and vacuum and zoom it off to wherever it’s going next. Whereas with a, with a mobile, with a smartphone, you may have a thread whose job it is, is to see if we’ve had any batch batch events that we need to send upstream. And then once the thread goes active in the foreground and it can talk to network, it burps it off
Gavin Henry 00:52:14 In your experience. Are these things that something should be always on or something a user has to enable or support can remotely activate. I guess that comes down to the app store potentially
Jamie Riedesel 00:52:26 Comes down to the abstract policy. I’m afraid. I mean, certain amount of trees or tracking is already done through the app through the advertising model for both source, but you’re walking a kind of a thin line between, are you trying to pretend, are you looking like advertising, where, where you need to track every single thing a user does for revenue purposes? Or are you trying to back it off a little bit and try to stay below the bar that the advertise that the app stores are looking for these sorts of phone home applications, your regular application function is probably one of the better ways. Cause when you click on a something and it phones home, you can still get those backend traces for what happened and what the user session was. That’s still there.
Gavin Henry 00:53:10 Yeah. Cause I guess you’ve got to try and think about what data do I need to give a good experience. So it’s still on the app store versus do I need to know if the user is using that feature?
Jamie Riedesel 00:53:21 This is where feature flags come in. A lot of ways phoning home for feature is, is, is a way to do that. And it’s a bit like dealing with the JavaScript application and browsers in the way that you, you only have some control over what’s going on
Gavin Henry 00:53:36 And how can we add telemetry without decreasing the readability of code or affecting performance.
Jamie Riedesel 00:53:43 That’s really up to the SDKs in a lot of ways, especially for using a third party. Now, if you’re building an in-house SDK, say like you’re a medium, a large size company, who’s building your own internal system for these sorts of things. That is something the SDK designers really need to pay attention to is developer usability. The tracing systems that I looked at are pretty good about that. Cause like when you, you have a one line of code, you need to insert to say, I want to start tracing here and another line to say stop tracing. And at that point magic happens and that makes it very easy for the developer in that case,
Gavin Henry 00:54:16 What should not be captured and shipped, would you say that would be
Jamie Riedesel 00:54:21 What I consider toxic data that’s that is the financial data, privacy data, health data, and security data. Quite frankly, you really don’t want to be passing around PII. European GDPR changed the world and a lot of ways for what counts as private data email address, that thing that was counted as a username for across the entire web for the last 15 years is now considered PII under GDPR. So if you’re logging the username of logins, you’re now logging PII, which means you need to treat that the same way you do PII in your regular application, which is usually more restrictive than a good telemetry system needs. So you need to be very careful to keep that sort of toxic data, the stuff that requires special handling and has penalties attached. If you get it wrong out of your telemetry systems
Gavin Henry 00:55:08 And PII personal identifiable information. Yep. Yeah. You need to know who the user is so you can do your job. So if you need to log the email on not so primary way to log in and then you just have to justify it. I think on your DDPR policy
Jamie Riedesel 00:55:25 Can, uh, and a safer way is to disambiguate that from the email address into whatever it is that’s showing up in your database, which may be an integer of some kind. So instead of logging the email that was on the form, log the account ID in your database, which is disambiguated from things. And it’s probably what you’re gonna be looking up in the database. Anyway, if you have to go look in there.
Gavin Henry 00:55:46 Yeah. It’s something that you can still link to somebody, a customer, but if it gets out, it’s not going to do any harm. Yeah.
Jamie Riedesel 00:55:54 Unlike like an email address, which is generally identifiable now.
Gavin Henry 00:55:58 Yeah. Could we talk a little bit about software telemetry and how that’s done in the hello sign, mobile app? Is that something you’re familiar with or
Jamie Riedesel 00:56:08 Yeah, the mobile development and pretty far from so like, honestly, I’m not sure how that’s done. I do know the front end for a regular web application stuff.
Gavin Henry 00:56:17 Okay. Take us through if you can, if you’re allowed a little bit of that, it’s no secret that
Jamie Riedesel 00:56:24 I’ve got an elk stack running internally and that’s, that’s primarily what we’re using for our, for our centralized logging has been going very well for a bunch of years. Now. We also have a metric stack based on influx DB and Grafana, which is again, working very well. And we had some teething issues early with understanding how stats D worked for metrics summarization. But I figured out the bugs in that particular problem and came up with a better way to handle it. So it’s been a good run. Good run. Frankly, the telemetry systems we’ve got right now are kind of my baby. So it’s been fun to watch.
Gavin Henry 00:57:00 I like it. I think that’s know the end of the show. Now, one of my favorite questions, what is the future? Do you protect for telemetry or is it still going to be called telemetry or are we going to come up with another name?
Jamie Riedesel 00:57:13 I have a feeling that because the pillars of observability or have a lot of mind share right now, we’re going to start seeing a lot more observability stuff. Mate, building observable systems as a concept you’re starting to see in books and blog posts right now. I think that’s going to increase, especially as the open telemetry project finally releases, and you start seeing the Kubernetes side of the world start going all in on these systems. You’re going to see more rebranding of software telemetry as more of an observability thing. But I think telemetry as a word was going to stay with us.
Gavin Henry 00:57:49 And obviously software telemetry is extremely important for software engineers and companies. But if there was one thing you’d like a software engineer to remember from our show, what would that be?
Jamie Riedesel 00:58:01 One thing to keep in mind is to pay attention to your data safety. The toxic data that I spoke about earlier, if your application is handling financial privacy or health information, your telemetry systems can possibly can possibly contain that. So you need to make sure that your telemetry systems are clean of that. And if they’re not clean to that, you needed to make sure that you have ways to clean up the spills. Because when you mishandled this information, your company faces a lot of liabilities. So state of privacy and health information is changing every year. So keep up with the changes and don’t be afraid to rewrite your telemetry history for it turns out you’re logging things that’s really shouldn’t have been doing.
Gavin Henry 00:58:42 And I think I’ve seen some tools that you can point at your GitHub repositories or only get source code. And I’ll look for those types of things. Do you know of any of that will go and look through your logs and try and flag up that so you can, you know, before you get into production,
Jamie Riedesel 00:58:57 I don’t know any off the top of my head. I do know of a few checkers that will check the log stream themselves for PII. Amazon Macy has one. Microsoft has a downloadable one that you can run on your own. And I believe an Azure also has an end point or you can submit stuff and it’ll flag PII that it finds in there. They’re not seeing, honestly, I’m not seeing a whole lot for just download and run because they major systems tend to be built on machine learning models, which are, they’re not releasing afraid. So this there’s not a lot in the open source space that will give you these sorts of detectors, hopefully in the future. We’ll fix that.
Gavin Henry 00:59:33 Yeah. I think the one I’m thinking of looks for secrets in your environment stuff,
Jamie Riedesel 00:59:39 We’re looking at things like passing password hashes in your logs. I mean, that’s the kind of thing that I’m looking for that I’m kind of warning against
Gavin Henry 00:59:46 Before we finish up. Was there anything we missed that you would have liked me to ask or you’d like to mention there is one
Jamie Riedesel 00:59:52 And it’s a little bit of promotional, but there’s something that a lot of engineers don’t quite realize is that their telemetry systems are still considered a business record and business records are subpoenaed. So for example, if your company got caught with its pants, pants around its ankle for GDPR in a big enough way, the regulators can and will ask for your telemetry systems, ask for your telemetry data. So it’s a good idea. When you’re thinking about making changes, church telemetry systems to consider, how would we respond if the regulator says, we need you to keep all our map, the regulator, if your own legal department says, we need to keep all of this data for, until we say otherwise, we can just in case someone decides to Sue us about this particular thing and having a plan in place to respond to that will take a whole lot of the panic out of that moment.
Gavin Henry 01:00:45 Yeah, I suppose you could just say right, we’ll keep it forever. That’s the answer.
Jamie Riedesel 01:00:51 It’s all a cost in the end. And the thing is that these databases changed formats and lots of searches famous. Every time it makes a major version change. It only restores the, like the previous versions of elastic search data. So if you have backups going back seven years and a new version comes out, you have to reprocess six years of history to be able to make sure it’s still restorable. And that’s a hard process.
Gavin Henry 01:01:16 That’s the age-old problem of what media to keep backups on. Isn’t it archive digital information.
Jamie Riedesel 01:01:23 Yeah. And the data format shifts too. And that’s probably, that’s where I see the biggest digital rot these days. Because these days you can just throw it on S3 bucket. No one cares except for whoever’s paying the bill, but being able to restore it’s another problem.
Gavin Henry 01:01:35 So where can people find out more? They can follow you on Twitter, but how else could they get in touch?
Jamie Riedesel 01:01:41 Uh, the Twitter is probably the best way to get ahold of me these days. I mean, that’s where I keep my primary social media presence. These days, I’m speaking here and there getting the book out, took a lot of my time away as did this pandemic or in the middle of so, uh, hopefully to get out back on the speaking circuit a bit more catch me at someplace, you typically will find me. I feel like the Lisa conference and trying to get to an SRE conference. And some of the DevOps Days around the area I’ve been noticing.
Gavin Henry 01:02:06 That’s the large infrastructure. What does Lisa stand for again?
Jamie Riedesel 01:02:11 Large Infrastructure Systems administration, although they sort of dropped the acronym a few years ago.
Gavin Henry 01:02:15 That’s right. Well, Jamie, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening.
[End of Audio]
SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)