October 4, 2016

SE Radio 270: Brian Brazil on Prometheus Monitoring

Venue: Internet
Jeff Meyerson talks with Brian Brazil about monitoring with Prometheus, an open source tool for monitoring distributed applications. Brian is the founder of Robust Perception, a company offering Prometheus engineering and consulting. The high level goal of Prometheus is to allow developers to focus on services rather than individual instances of a given service. Prometheus is based off of the Borgmon monitoring tool, widely used at Google, where Brian previously worked. Jeff and Brian discuss the tradeoffs of choosing not to replicate our monitoring data. In some situations, the monitoring system will lose data because of this decision. Other topics that are discussed are distributed consensus tools, integrations with Prometheus, and the broader topic of monitoring itself.

Show Notes

Transcript

Transcript brought to you by innoQ

This is Software Engineering Radio, the podcast for professional developers, on the web at SE-Radio.net. SE-Radio brings you relevant and detailed discussions of software engineering topics at least once a month. SE-Radio is brought to you by IEEE Software Magazine, online at computer.org/software.

* * *

Jeff Meyerson: [00:00:35.12] Brian Brazil is the founder of Robust Perception, a company that helps scale and support users of the open source Prometheus monitoring tool. Brian has committed heavily to the Prometheus project. Before starting Robust Perception, Brian worked at Google. Brian, welcome to Software Engineering Radio!

Brian Brazil: [00:00:52.17] Hi, Jeff. Glad to be here!

Jeff Meyerson: [00:00:54.28] Let’s start off by talking about monitoring. What is monitoring?

Brian Brazil: [00:01:00.27] I’ve talked to a lot of people over the last year or so, and different people have very different ideas of what monitoring is. For example, when I say that I do monitoring, I’ve had people thinking that I’m watching your network traffic to see if you’re accessing Facebook at work, which is obviously not what we do. I see monitoring as four things. One is alerting you when there is a problem or [unintelligible 00:01:23.13]. The second thing is giving the information you need to be able to debug that problem, and the third things is to do long-term trending and making business decisions. If you know your cache hit rate is 90% and your change will go to 95%, then that means that you can get rid of the hardware or add new hardware.

[00:01:44.16] The fourth one is just general plumbing. For example, if the hard drive fails, at the end of the day you have to send a human in to fix that, because the robots aren’t good enough yet. Those are the four things I see with monitoring in the computer space.

Jeff Meyerson: [00:01:57.15] Give me an example of something that you monitored when you worked at Google.

Brian Brazil: [00:02:04.08] I was primarily working in ads at Google, so we were monitoring the system, like the number of queries coming in per seconds, the [unintelligible 00:02:11.23] latency, as well as more system details. We had tons and tons of stats.

Jeff Meyerson: [00:02:19.24] I want to use that as an example later on. We’re talking about Prometheus today – you’ve claimed that Prometheus is a next-generation monitoring system. Could you talk about how monitoring using Prometheus would differ from a monitoring tool of the previous generation?

Brian Brazil: [00:02:40.07] If you’re looking at the previous generation, you’re going to look at something like Nagios, or something else that’s derived from it, which there are quite a few. Nagios work on the principle of per machine checks. I run this shell script, it returns true or false, and based on that I send an alert to a human. This is fine when you have five machines that you’re carefully caring for, tending and feeding. But in the current generation, where people are talking about serverless, talking about cloud, talking about cloud native, you might have hundreds of machines which are tiny and which appear and disappear all the time. An alert on a single machine isn’t really that useful. It also might be the case that one machine is being a little slow. That doesn’t necessarily mean the service as a whole is having a problem, because you’re still well within SLA.

[00:03:32.10] The canonical example where this comes in is you’re running some form of web service, and you have an SLA for latency. Let’s say latency has to be under a second. It’s extremely difficult using just Nagios to alert you on the fact that your latency is over a second, because you can only look at each machine, and you have to get that data somehow, and you can’t look at the overall service view.

[00:03:58.01] What other people end up doing is alerting on CPU being high. That sometimes might indicate a problem, but it’s not going to spot a deadlock, for example, and it’s also going to give a lot of false positives if logrotate runs a bit long. Whereas with Prometheus, because you’re insight from inside all the applications, you can get that latency, you can aggregate it up correctly and say, “Hey, I’m currently doing 880 ms. Everything’s fine.” If one server is a little slow, as long as you’re under that second, everything’s good.

[00:04:27.09] Also, because it’s meant for cloud environments, as machines are added or removed by autoscaling, Prometheus will automatically pick that up and take care of it. You don’t, as with Nagios, have to restart, update and configure it every time.

Jeff Meyerson: [00:04:41.10] Talk a little bit more about why the move to cloud architecture changes how we are doing monitoring.

Brian Brazil: [00:04:49.18] The cloud architecture and the microservices as well to some extent affect it, because previously you’d have a few bare metal machines. Because you have only a handful of them, you can get away with doing things by hand. As you move into more dynamic environments and cloud environments, machines aren’t really a thing anymore. What you have is some form of computing substrate, because you’ll say “Hey, give me two CPUs, a GB RAM and a bit of disk”, and it lands somewhere.

[00:05:18.24] There’s no longer this tight binding between the services and instances and the processes you’re running, and the actual machines. You have to care a lot more about where things are and be able to deal with that, because you no longer are talking about the MySQL machine. It is a machine, which is currently running one part of MySQL service.

Jeff Meyerson: [00:05:36.16] When we’re talking about monitoring, does monitoring also encompass the act of responding to events, or is it merely the processing of events and the creation of metrics about those events?

Brian Brazil: [00:05:52.03] That’s one of those things that it depends on who you talk to. Personally, I would say that monitoring ends when it’s sitting on something like PagerDuty, and then you’re onto instant response; other people would say it’s part of the system. It probably boils down to your cultural philosophy. For example, if you have Enoch and that’s part of your control system for keeping things running, then arguably that’s part of monitoring, whereas if you take an approach where everything is more [unintelligible 00:06:19.00] putting a lot of intelligent thought behind things, it’s kind of a separate instant response thing. It probably just matters in your personal opinion; I don’t think there’s a bright line answer on that one.

Jeff Meyerson: [00:06:31.08] In the past, software companies would often end up with a variety of monitoring tools. Why did that happen?

Brian Brazil: [00:06:38.28] The thing is that there’s various tools that have been developed over time that just do one thing particularly well, or supported one integration particularly well. For example, if you’re going to be using ElasticSearch, it has its own thing; if you’re using Cassandra, you [unintelligible 00:06:53.22] and each of those are integrated and work well for just that tool. Then you end up with all of those over time, handling different things.

[00:07:03.27] One of the things that is neat about Prometheus is because we have all these integrations, that’s great, but every other monitoring system now [unintelligible 00:07:11.15] whether they’re commercial or open source. That’s annoying, that we’re all spending our time reimplementing these integrations, rather than focusing on the more interesting problems.

[00:07:23.05] One of the things with Prometheus is that yes, we have other exporters, and we also provide the API’s and the interfaces that you can reuse for other monitoring systems. You don’t have to duplicate as much stuff.

Jeff Meyerson: [00:07:35.26] Is duplication the main negative consequence of when a company has to run multiple monitoring tools?

Brian Brazil: [00:07:45.20] Yes, it’s one of them. Each of them tends to have a slightly different philosophy, slightly different data models, different alert thresholds, different levels of power. And as with any system, whether it’s a built system or a language, or even a text editor [unintelligible 00:07:59.27] for every single additional one of those you have in your ecosystem, that’s more cognitive load for all of your team. It’s not so much that it’s monitoring, it’s that it’s one more thing everyone has to learn the intricacies of, all the special cases and all that sort of thing.

[00:08:18.06] If you’re able to consolidate down, so that the size of the system people have to understand is smaller, that’s better for everyone.

Jeff Meyerson: [00:08:25.03] As we get to talking about Prometheus, Prometheus is based on a monitoring tool that was built inside of Google, called Borgmon. What was unique about Borgmon?

Brian Brazil: [00:08:36.04] I don’t know too much of the history of Borgmon, I just used it a lot. If you look at Prometheus, it’s just that it’s meant for dynamic environments. It has labels, calls, tags or dimensions, depending on what system you’re looking at, and it has a powerful query language. It can ingest large amounts of data, process it and produce pretty high-quality data and information out of that. That’s really unique, that you have the labels and you have the query language.

Jeff Meyerson: [00:09:09.07] Borgmon is the monitoring system that monitors Google’s Borg; what is Prometheus? Why did Prometheus get created?

Brian Brazil: [00:09:21.13] Prometheus was started off in SoundCloud by Julius and Matt because they had StatsD and it wasn’t scaling particularly well for them. They started doing a spike to develop a better system, and that’s where Prometheus started. I got involved a year and a half later, and here we are now, three and a half years in. It started off as a reaction to the existing monitoring systems not being good enough.

Jeff Meyerson: [00:09:49.06] Prometheus improves the visibility into the internal health of our services. What is Prometheus doing in contrast to other monitoring tools?

Brian Brazil: [00:10:01.21] Do you have a specific tool in mind? Because there are three or four broad categories.

Jeff Meyerson: [00:10:06.09] No, I didn’t have anything in mind. What would be the most salient contrast in your mind?

Brian Brazil: [00:10:12.07] The big one you see is between metrics and logs, and the other one is between metrics and logs/profiling, and then all of those in black box. I’ll start with talking about logs, such as if you’re using the ELK Stack, Fluentd, or one of those solutions. Logs and metrics are looking at the same data, but looking at it in a different way. If you have a log line being produced per user request, for example, then you’ll have so many fields in that log entry, but practically speaking you’re going to be limited to maybe 50 or 100 in terms of bandwidth and disk resources. For every single request you get these 50 things, but you can’t really go to a thousand there just because it would use too much disk bandwidth, too much network I/O.

[00:11:06.28] For metrics, by contrast, they don’t know about every single event, because they’re time compressed. Every 10 seconds, every 20 seconds, every minute, you take a snapshot, and because it’s not looking at every single request, because it’s been aggregated across the time domain, then you can have 10,000 of those, no problem. But you’re using that [unintelligible 00:11:24.25] request information.

[00:11:27.01] That’s where the big contrast is, and the way I would see it is that you would start off with your metric system (Prometheus) and drill down to figure out which subsystem is at fault. Follow the latency through your microservice architecture and say, “Okay, it’s this service.” You’re looking at the rest of the matrix, because you’ve got 10,000 of them, and you notice, “Okay, it looks like it’s the building subsystem.” Now that I know it’s in the building subsystem and I know which request paths hit them, I jump over to the logs, see who’s been [unintelligible 00:11:55.09] slow queries and figure out that, “Oh, it’s this particular user, it’s this request”, and go from there.

[00:12:03.14] There’s this interplay then that metrics kind of narrow down what’s going on, and then you can jump to logs to get more information and figure out, “Okay, we know which subsystem. Now, which exact requests are the problem?” That’s where I see the interplay there.

The other one then is profiling. Profiling, I would consider anything like GDB, or if you’re pulling an strace, or any of the actual profiling tools, anything DTrace-y.

[00:12:31.16] It’s similar to logs and metrics, except it’s doing both of them for a really short period of time, because it’s really expensive to do otherwise. If you’re sampling every microsecond or hundred microseconds and pulling in every single event, that is a tremendous amount of data, and you can’t turn that on all the time, because it’s too expensive. However, once you’ve narrowed down the problem, you can pull out your profiling tools, debug the problem and go, “Okay, I figured it out. I’m going to turn this off now, because I can’t leave it on all the time.”

Jeff Meyerson: [00:13:00.15] One of those things that you listed there is that you want to be monitoring services rather than machines. What do you mean by that statement?

Brian Brazil: [00:13:11.04] As I mentioned in the previous example, if I have a machine that’s running MySQL in the old model of monitoring machines, then I’ll have a check that MySQL is turned on, I’ll have a check that the CPU is okay, the memory is okay. But if that’s part of a set of [unintelligible 00:13:28.21] from MySQL, I no longer care about an individual MySQL machine. I only care that there are enough MySQL’s running to provide good service.

[00:13:39.02] Let’s assume that MySQL is being used extremely simply, I just might look and say, “What’s the CPU load? Is it okay? Is latency okay across the entire fleet?” and if two of them are down but all of the other metrics are fine and everyone is getting good service, good reliability, then there’s no reason to alert. I’m not thinking about each individual machine, but thinking about the overall service and the overall view that the end user is getting. It’s kind of like the pets versus cattle view.

Jeff Meyerson: [00:14:10.15] Let’s start to talk about Prometheus in practice. What’s a Prometheus client?

Brian Brazil: [00:14:16.14] There are the Prometheus client libraries. There’s 11 languages, I think. In order to get the most benefit out of the metric system (just like a log system), you need to instrument your code. What we provide for these are the client libraries: Go, Python, Java, Ruby are the main ones. You go into your code and say, “Every time I pass this line of code, increment this thing by one.” Or, “Every time that I go into this function, time how long it takes and track stash.” Then that’s all kept in the memory state, and every now and then Prometheus comes and takes a snapshot of that.

[00:14:55.28] At its core, it’s just the way you say, “Hey, this is the interesting information that I would like to expose”, and we take care of all the book-keeping, concurrency, efficiency and all that sort of thing for you.

Jeff Meyerson: [00:15:07.10] With the example of maybe the ads platform that you worked on, what kinds of calls would a Prometheus client need to implement? What methods would it need to implement? What would it need to be able to respond to, and what kinds of information would a Prometheus client need to produce?

Brian Brazil: [00:15:24.15] The client itself is very generic. We’ve got four types – the Counter, the Gauge, the Summary and the Histogram. A Gauge might be in-progress requests, because that goes up and down, like a gauge in your car. If that goes very high, you have a problem, or maybe you’ve got a queue – how big is the queue versus the limit of the queue?

Then you’ve got Counters – counting how many requests are coming in, how many requests are coming for different ad formats, like Flash versus text. Then the Summary and Histogram are for tiny latency or maybe the number of bytes/request, to track that over time and see what that distribution is.

[00:16:04.14] That’s at the library level, the sort of mapping you’d have for the types to what’s provided. But at the end of the day, it’s up to the user to build things on top of that, because the library doesn’t know anything about ads, for example.

Jeff Meyerson: [00:16:18.29] What would be the process for setting up that client? What kinds of programming hooks do you have to write to set up those different metrics that you want to give off?

Brian Brazil: [00:16:31.01] If I take the example of the Python client, you’d install it via pip – pip install prometheus_client, and then you’d import it, as you would with anything in Python. You create your metric, so you’d go my.metric=summary (give it a name, give it a help string). Then let’s say you want to time a function – you just declarate it, so @mymetric.time and that’s it. So there’s only really two lines of code there. One is to set up the metric, and then one to use it.

Jeff Meyerson: [00:17:03.22] Great. What is a Prometheus server?

Brian Brazil: [00:17:06.14] The Prometheus server is the core part of Prometheus. It is what goes and talks to all the clients and all the exporters, pulls in all the data, stores it on a local disk (I prefer the SSD), runs rules on it for alerting, and sends out alerts. It’s also available for graphing requests or pulling out the data over HTTP, which is how almost all of the dashboarding solutions work.

Jeff Meyerson: [00:17:31.02] Do you need a Prometheus server for each of the different services that you want to monitor?

Brian Brazil: [00:17:36.27] You could have one Prometheus server, there’s nothing stopping you from doing that. It is actually surprisingly efficient. A single one can take 80,000 samples per second, but normally what happens is that for organizational reasons, each team would get their own Prometheus. But there’s nothing stopping you running everything inside one.

Jeff Meyerson: [00:17:55.02] What are those organizational reasons for having one Prometheus server per team?

Brian Brazil: [00:18:00.04] If two teams are fighting over which way it should be done and have different opinions, you can do that. Then there are different resources, arguing who’s going to do which… They might have their own Prometheus server. There’s also an aspect of isolation there. If one team happens to put on a metric that’s way too big and starts causing problems, it’s going to affect everyone else on that Prometheus server. Being able to say, “Look, each of you have your own” is one way of looking at that. Prometheus itself is also extremely easy to run. It’s a standard Unix daemon.

Jeff Meyerson: [00:18:33.14] You mentioned that different Prometheus clients might implement their metrics in different ways. Are there also subjective decisions to setting up the Prometheus server?

Brian Brazil: [00:18:44.28] Yes, there are a lot of subjective decisions. The main one comes up around service discovery and what we call re-enabling. If you look at any two organizations and how they think about their machines, maybe one’s in Amazon, one’s in bare metal. The one in Amazon cares about availability zones and regions. The one in bare metal only thinks about data centers. So they do have two very different models of the world.

[00:19:11.09] The way that they’ll think about the targets and the key/value pairs they will associate with them is going to be different. The interesting thing is that it turns out not only is this vastly different between companies, to the extent that no tools are likely to have the same mental model, it’s also very commonly different between teams and even within the team in the company, just because different people have different ideas of production versus development versus canary; people are laying stuff out by customers, or maybe they’re laying it out by team. That sort of subjectivity comes in to how you lay things out.

[00:19:45.28] One of the advantages of Prometheus is that everyone running on a Prometheus server can do their own thing and model the world in a way that makes sense to them.

Jeff Meyerson: [00:19:52.26] How does a Prometheus server discover the clients that it wants to connect to?

Brian Brazil: [00:19:58.16] For a service discovery there are a few options. We support EC2, Azure… Google Compute Engine is still in code review, so we can just discover machines via those. Then you have Marathon and [unintelligible 00:20:14.14]. We can pull data from there, including all the metadata they can give us. Then you’ve got nerves and serversets on top of Zookeeper. There’s also Consul as well, for people who use it; those are more generic, things explicitly designed for service discovery. Those are all available. We can go out and fetch those, and then the user can model data wherever they like on top of that, using re-labeling.

Jeff Meyerson: [00:20:41.02] Once the service discovery process has taken place, what is the communication pattern between the Prometheus server and the Prometheus client?

Brian Brazil: [00:20:52.18] The Prometheus server will send out an HTTP request (it might be HTTPS) and it will send a scrape, normally to slash metrics endpoint. The Prometheus client will then send back the metrics. Then it will just keep that HTTP connection open and keep on sending normal HTTP requests at a regular interval.

Jeff Meyerson: [00:21:12.23] What is a typical interval?

Brian Brazil: [00:21:14.16] It depends. The default is 15 seconds. We’ve got people going from sub-second up to several minutes.

Jeff Meyerson: [00:21:22.11] What might be the determination for how long you would want to wait before pinging your Prometheus client?

Brian Brazil: [00:21:30.03] At the end of the day it boils down to resources. You can imagine, if you consider one second versus 60 seconds, that’s 60 times resources you’ll spend in terms of network CPU, disk and so on, and you have to balance those costs. At the end of the day, this is a real-time streaming processing system, and those have costs.

Jeff Meyerson: [00:21:50.25] So the Prometheus gets set up, you’ve got a client running, but the client is usually an instance of a replicated service. Let’s disambiguate that – how does the communication between the server and all of the different instances of a given client work? Am I doing a Round Robin pinging of the different clients, or am I pinging them all at the same time?

Brian Brazil: [00:22:16.02] It’s independent completely. Service discovery produces a whole pot of targets. Let’s say we’re working on a ten-second interval. The identity of that target is hashed, so we spread the load around. Then it just scrapes based on that. One might be at one second, one might be at seven seconds, if you have two; they’re at that offset continuously.

[00:22:38.23] If you had many hosts, it would just happen more and more often, and those ten seconds [unintelligible 00:22:42.18]. But each is a goroutine, and they’re independent, just with a hash for consistency.

Jeff Meyerson: [00:22:49.21] How does the data from the different instances get aggregated?

Brian Brazil: [00:22:55.15] By default, Prometheus will take in the data. There are two ways to do the aggregation. One is when you send you request from, say, Grafana for graphing; you have an aggregator in there like “sum” that will on-the-fly do the math.

The other option – because if you start touching a few thousand time series, that can start to get a little expensive – is Prometheus can run recording rules (other systems call them standing rules/queries) that it can just evaluate regularly and pre-aggregate that stuff; that’s cheaper to access, because you need to touch one time series, rather than thousands.

Jeff Meyerson: [00:23:34.24] Is the data being pushed from each client? It sounds like the central server is pulling data from the clients.

Brian Brazil: [00:23:41.00] Yes, Prometheus is a pull-based system; it’s pulling every ten seconds, say.

Jeff Meyerson: [00:23:45.24] Why is there such a debate around the pull-based versus push-based monitoring?

Brian Brazil: [00:23:50.10] I’m not really sure, but I have some ideas why. It is a bit of a religious issue, a bit of Vim versus Emacs. In my opinion, there’s not too much technical difference between them. I think pull is slightly better, but only very slightly. One of the things that may impact it is that Nagios is a pull system, which can be tricky to scale, and people are thinking, “Oh, Nagios is pull, and Nagios doesn’t scale, therefore pull doesn’t scale”, which is an elemental fallacy. In reality, push and pull will both scale pretty well. That’s one thing that might come into it.

[00:24:28.15] Another one is a lot of SaaS vendors have to use push for network reasons, so people are in for that. But I really don’t understand a lot of the really vehement opinions that push or pull can’t scale, because it is possible to scale both, as long as you’re willing to spend the time.

Jeff Meyerson: [00:24:45.18] Can you talk a little bit about how the scalability of a push-based system contrasts with that of a pull-based system?

Brian Brazil: [00:24:52.17] Sure. For a push-based system, you are going to get a stream of samples that are normally going to hit some form of load balancer. You then need to have something which is going to take in those samples, shard them back out so it can hash them across a set of servers, and then process them from there. The load balancer there is normally where things get a little more challenging, because you have what might be a tremendous volume of data, all in one place.

[00:25:18.11] On a pull-based system you don’t need that load balancer because each particular Prometheus server (or other pull system) is already pre-sharded, shall we say. You might have one for the MySQL, and it just pulls in that data. So you don’t have a large edifice in having to manage all that data.

Jeff Meyerson: [00:25:36.26] Talking more about Prometheus, when we get data from a Prometheus client into the Prometheus server, how does that data look? What is the data format?

Brian Brazil: [00:25:48.01] There are two data formats. The one that most users would see and which all the clients produce is in text. There are a few comments from metadata, and then there might be (mymetric 3), or (mymetric 1.5). There can also be labels in [unintelligible 00:26:06.11] inside there.

The other format which is equivalent is protocol buffers. At the moment only the Go client produces that. In fact, the media server in client use [unintelligible 00:26:17.11] negotiation to figure out which one to use. The format that most people would end up using is a simple text-based format, and you can produce it – assuming you don’t need the escaping – pretty easily.

Jeff Meyerson: [00:26:31.14] The volume of data from monitoring is quite voluminous. Where does the data get stored? Does it get stored directly on the Prometheus server, or does the server push it to some other database?

Brian Brazil: [00:26:42.25] The data ends up stored on the Prometheus server, but we recommend SSD for performance reasons. In fact, with the new [unintelligible 00:26:49.07] it’s only 1.3 bytes per sample in production use cases, and it sits there. But we see Prometheus isn’t itself intended as a long-term data store, because the problem with a really long-term data store is that implies a distributed storage system. That means a distributed system, and those are really hard to get right. It’s not my opinion that you want that sort of higher distributed system problem, which is as likely to lock up in an emergency as help you when tightly coupled to a Prometheus server.

[00:27:23.11] We will have long-term storage of some form, but Prometheus will push the data out. Today it can push out to in Influx, Graphite and OpenTSDB, but there’s still experimental support. Ultimately, we want to be able to plug out to other systems that can store the data long-term, and Prometheus can also transparently request data back from them. Then we have decoupling where Prometheus will have maybe a few weeks or months of data right there, so even if the long-term storage is having a little bit of a problem, you can still get your critical monitoring in an emergency.

Jeff Meyerson: [00:27:57.15] Is that to say that the data on the Prometheus server is not replicated, it’s somewhat vulnerable to being lost?

Brian Brazil: [00:28:08.13] Yes, that’s correct. We generally consider Prometheus data to be ephemeral. It’s a cache, more than anything else. But in terms of reliability, if you want high availability and not be reliant on a single machine, with Prometheus it’s really easy with the pull model. You just turn up a second Prometheus server that’s acting identically to the first one, and then they both have the data, and they’re both alerting.

Jeff Meyerson: [00:28:36.24] Right, but I guess that replication model would still potentially be vulnerable to difs between the two. But since it’s monitoring, you’re not too concerned if there’s slight difference in one instance of a metric.

Brian Brazil: [00:28:54.27] Yes. There are so many waste conditions in monitoring, that this one doesn’t actually matter. If you can imagine the scrape… Or, you even have the exact same in a push-based system – if the network’s slightly slower or faster, or the scheduler in the kernel is slightly off, you’ll get slightly different results. Most of the races are of that magnitude, and you’d be looking at the similar for this. As well, if you think about it, going back to the example of the 10-second scrape interval, you don’t know where it is in that 10 seconds. If you get different answers and it’s significant to you in terms of alerting exactly where you land in that 10 seconds, then your monitoring isn’t going to be very resilient or robust. You need to be able to be resilient with that anyway, because that’s just the nature of monitoring.

Jeff Meyerson: [00:29:44.09] One thing I’m hearing here is that the monitoring data doesn’t need to be as durable as business data. This seems like a pretty fundamental difference between how we look at business data versus how we look at monitoring data. Are there any other ways in which that implication affects the architecture of Prometheus?

Brian Brazil: [00:30:07.15] It’s not so much durability as the engineering tradeoffs. With Prometheus, we value availability over consistency. At the end of the day, if your network is falling apart, you don’t want something like Zookeeper, that’s also going to fall apart to depend on it.

In terms of other places for reliability – for example between Prometheus and the alert manager. The alert manager takes in all the alerts – because all of the alerting logic is inside Prometheus – it generates alerts in the [unintelligible 00:30:36.08] alert manager, the alert manager then talks to e-mail, talk to PagerDuty, talks to Slack, talks to HipChat, can talk out to JSON and HTTP if you want it to.

[00:30:46.06] Our approach for making that communication reliable is that the Prometheus server is just continuously repeating those messages. Similarly, on the alert manager side, there is a repeat interval plus a few retries. Humans sometimes drop pages and ignore them by accident, or are sleeping in or what not, so there’s a repeat interval to resend that alert after an hour, because nothing happened. So it’s just multiple retries all throughout the system, just to make things nice and reliable.

[00:31:18.09] In practice, because of damaging crashes or what not, you might be missing an alert for a minute, but you’ll get it later and that’s okay, because it takes a human five minutes to [unintelligible 00:31:28.21] laptop anyway.

Jeff Meyerson: [00:31:32.00] A system like Zookeeper or Consul – they’re often useful for coordinating a distributed system by providing consensus. What are the downsides of requiring consensus coordination and why can you avoid consensus when using Prometheus?

Brian Brazil: [00:32:00.04] Consensus is great when you need the consistency. You can imagine for building data it’s probably the best example, where you need everything to be perfect. In monitoring with Prometheus, we’ve made the engineering choice not to use consensus, not to use at least highly consistent consensus.

If you have a quorum, for example — let’s say, your standard example, you need three nodes for quorum, and you have a network partition that knocks out one of them, and a machine failure knocks out the other – suddenly, you can’t progress. Depending exactly on how you are set up, you could just then be dead in the water and the monitoring system no longer works. Prometheus sidesteps all those questions by having each server completely isolated and doing its own thing.

Jeff Meyerson: [00:32:46.00] Is the consensus that would be required in something like a billing system – if we’re doing billing systems instead of monitoring – would it be the point of consensus around the information that is being stored? I guess you would need multiple records, whereas in this case you just have a single monitoring server, you have a single Prometheus server that you’re looking at as the monitoring source of truth, and you don’t have strong guarantees about the consistency of that data.

Brian Brazil: [00:33:23.09] Well, you can have reliability if you’re running two Prometheus servers. They might have slightly different data, but they will provide you with reliable alerting, because there’s two of them.

Jeff Meyerson: [00:33:35.03] Let’s talk a little bit about the querying model. What is the API for querying the Prometheus server?

Brian Brazil: [00:33:45.21] The priority one is the HTTP API. You just send it a query, and it will send you back the results. That’s how all the graphing solutions work. You just say, “At this time, execute this query”, or “Between these time intervals, at every ten seconds, give me the results.”

Jeff Meyerson: [00:34:08.24] What would be an example query that we might make against our monitoring data?

Brian Brazil: [00:34:14.14] A simple one is you have a request counter – let’s just say it’s called request_total, which is a little ambiguous. The first thing you’d want to do is convert that into a rate, so a per-second value, because a counter that’s just continually going up — up and to the right is what you’re wanting your graphs, but not that useful. So you would go rate_request_total [1m]. That will tell you the per-second rate and you could graph that.

[00:34:45.16] But that’s going to be per-instance, and we care about service-level stats. We want to get rid of that instance distinguishing dose, so we could then do on top of that some without instance rate of request total over five minutes, and then we’d have the request rate across that metric for the entire Prometheus server.

Jeff Meyerson: [00:35:08.06] This is a more general monitoring question – what are the different situations where an engineer is querying the monitoring server in the day-to-day workflow? Am I only querying it to set up dashboards? Am I querying it to figure out a situation during a fire? What are the different reasons why I might query my monitoring server?

Brian Brazil: [00:35:34.10] Normally, you’d use it via preexisting dashboards, because the key statistics like requests/second, latency, CPU usage, memory – they’re going to be in the dashboard already for you. But as you go down and debug into a problem – you might have one per service and a few for the major subsystems of that service, once you drill down far enough in, it’s possible that there is no dashboard for what you want already, and then you would just directly start to write queries.

[00:36:04.24] Anything like that where you’re exploring the metrics – maybe you’ve got the code in one hand and you’re like, “This metric here, if it’s being incremented, it means we’re going down this code path and it will help me debug. There’s no dashboard for it, let me write a quick query.” That’s sort of interplayed in. It’s like, “Oh right, that went on that code path. This metric then is bouncing between code and ad-hoc queries.”

Jeff Meyerson: [00:36:26.03] Right. Maybe an ad-hoc query in the example of the advertising is you don’t have a dashboard built to look at the metrics of an individual website that’s making requests for ads, but if you start to have some kind of problem associated with that website, you might want to create an ad-hoc query and query all the instances of that website requesting an ad.

Brian Brazil: [00:36:52.03] Yes, and because it’s an interplay as well, you might even have discovered this website because it’s mentioned in an error message in a log. All these different tools are complementary, between your logs, your metrics, your source code, and you need to look at all of them when you’re looking at an outage.

Jeff Meyerson: [00:37:09.28] Am I using Prometheus as my logging, or do I notice a problematic metric with my Prometheus metrics, and then I might go into the logs on the associated server?

Brian Brazil: [00:37:26.11] Normally, the Prometheus would spot that out. Prometheus si good for spotting mostly systemic issues. You notice that, “Hey, this error ratio is now at 1% or 2%.” Then you go further and notice, “Okay, I’ve looked at all the servers. This one server has 20%, the rest have none.” Now it’s time to look at that server and see what’s wrong. You can drill down like that, and then you look at the server, maybe you look at the logs.

[00:37:51.09] From the other side, it’s possible, because Prometheus is good for systemic issues. If you want to catch a one-off thing, logs are a better option.

Jeff Meyerson: [00:37:59.24] What happens with a network partition? When there’s a network partition between the Prometheus server and the clients that it’s supposed to be pulling from, and this partition lasts for an extended period of time – does that data just get lost or is there some way to retrieve it?

Brian Brazil: [00:38:19.16] In that case, the data would just be lost. Prometheus will attempt to scrape, it will time out, it will report the up variable as zero, and that’s it. The question then is – particularly when you have an HA setup and one is on one side of the partition, one is on the other, could they resync and merge back in? And the problem with that is first a question of semantics. That’s basically a Byzantine scenario, and you don’t know which is the [unintelligible 00:38:45.14].

The second problem is you just had an outage or you’re in the middle of an ongoing network flakiness, and you’re just chugging along your normal pace. If you’re going to start backfilling data, you might be double or tripling your load, which might make any outage worse.

[00:39:01.00] So you’ve got those two problems there. It’s not clear what the right semantics are, and this could actually make an outage worse. It’s that sort of situation where by and large it’s not worth risking the reliability of the system in terms of that outage load and in terms of the code and all the risks there, than just saying, “You know, we’re going to have a blip once a quarter when a rack switch, when someone brushes against it, and that’s okay”, and making that sort of engineering tradeoff, realizing that simpler is better.

Jeff Meyerson: [00:39:32.04] Are there certain high sensitivity services where that’s not an acceptable strategy?

Brian Brazil: [00:39:38.13] There probably are, but in that case you’re looking for a different sort of monitoring. At that point, you’re looking for consistency over everything else, and in that sort of situation you would turn off the service if the monitoring broke, just because it would be bad for you to continue accepting queries, because you can’t reliably persist them.

Jeff Meyerson: [00:39:56.16] If I had an architecture that had some high sensitivity service where I need really consistent monitoring, how would I fit in that heterogeneity with a platform that is mostly fine with using Prometheus? How is Prometheus compatible with that kind of monitoring heterogeneity?

Brian Brazil: [00:40:21.04] It probably depends on the exact circumstances you’re looking at. With Prometheus, if the choice is between consistency and availability, we’ll always choose availability, because we’re okay with a small blip in data. If you’re not okay with a small blip in data, we can monitor the system that’s doing that and tell you how it’s generally performing, with all the metrics. But for that exact precision, you’ll need some other solution.

[00:40:46.14] For example human health data – it would be a bad idea to put that in Prometheus. However, there are companies who are monitoring the systems that are doing that, and that’s okay.

Jeff Meyerson: [00:40:57.09] Can Prometheus monitor its own health?

Brian Brazil: [00:41:02.18] Yes, it’s pretty normal to have Prometheus scraping itself.

Jeff Meyerson: [00:41:06.11] What’s the architecture look like for that? Do you have a separate server that is set up to be the Prometheus server’s server?

Brian Brazil: [00:41:15.05] Yes, that would generally be a good idea. It’s normal also to have each Prometheus scraping itself.

Jeff Meyerson: [00:41:20.10] Interesting. What’s the onboarding process for a company that wants to use Prometheus?

Brian Brazil: [00:41:27.26] Normally, it’s best to start out small and no risk, so you take the node exporter, which is [unintelligible 00:41:33.07] with Node.js (our machine agent), you would install that on some machines, and install Prometheus to monitor those. Then you’ll get 500-600 stats from the Linux kernel: your file systems, your CPU, your network, your disks… You can look at that, see if you like it, and from there expand that maybe to more infrastructure stuff like Cassandra, MySQL, Postgres, HAProxy. Then, once you’re really comfortable, you can start getting the real value, which is when you’re starting direct [unintelligible 00:42:01.08] and using those client libraries in your code.

[00:42:05.08] Obviously, going straight to client libraries is a bit of a big ask, because it’s saying “Hey, I want my developers to write a whole pile of code on this system we’re trying out”, which is a bit of a hard sell. But if you can say, “Hey, with the node exporter you can be up and running inside 20 minutes and it’s easy to turn off again” – not a big risk, one person can do it. That’s a much easier sell.

Jeff Meyerson: [00:42:25.13] What are the kinds of services that might be more difficult to monitor with Prometheus, or might take more of a complex architecture than we have defined so far?

Brian Brazil: [00:42:37.06] Ones that can be tricky is where there are lots and lots of short-lived processes. Prometheus generally presumes that things live for a while, let’s say tens of scrape intervals, whatever that is. If things are shorter than that, you’re going to miss things. There are some things out there – Spark is one of them – that has lots of very short-lived things. Even a pure serverless architecture will be tricky, although that presents some other challenges.

[00:43:04.25] Other than that, anything where you are trying to do what really looks more like logging than metrics is going to be challenging, just because of the cardinality of the data. Prometheus worked on metrics, and if you’re creating a new time series every time you run a job, eventually that’s going to run into a performance limit.

Jeff Meyerson: [00:43:25.22] What kinds of tools do people use in conjunction with Prometheus?

Brian Brazil: [00:43:29.04] The most common one is Grafana, which is what we recommend for graphing. It’s probably the biggest one. I’m not sure if there are any other tools that are particularly consistent across the user base, because everyone’s so varied.

Jeff Meyerson: [00:43:44.16] Your company, Robust Perception, offers Prometheus consulting. What are the typical situations that your clients are in when they come to you?

Brian Brazil: [00:43:54.24] We’re dealing with both big and small companies. A smaller company might not have much monitoring at all and just want a hand getting things up and running. The larger companies tend to have an existing monitoring system, they tend to have an existing monitoring team, and they’re aren’t scaling too well. It’s a lot of operational effort and not a lot of value; things aren’t aligned, and it’s like, “How do we best use Prometheus? How do we deploy it? How do we make sure it scales? Are there any pitfalls that we can avoid?” That sort of thing.

Jeff Meyerson: [00:44:29.08] And these companies that already have some monitoring set up, how do they integrate Prometheus into the rest of their monitoring infrastructure?

Brian Brazil: [00:44:38.26] Technically, it starts off with Prometheus, as any new monitoring system coming in is just a more advisory thing there on the side, and over time it starts getting an alerting role, as well. You then end up with two monitoring systems for a while, until you can cut over. It’s a question of what your processes are and how you deal with alerts coming from a new system, and dashboards in a new system.

[00:45:03.19] Most of the companies are already using Grafana, so it’s easy for them to use new dashboards. The challenge normally comes more around alerting and what that process is. For example, if you’re used to using something like Uchiwa with Sensu [unintelligible 00:45:15.10], and then Prometheus has a different one, you need to think about how you integrate those together. Do you write a consolidator, do you just have two ones to look at, or maybe you have a completely different operational model?

Jeff Meyerson: [00:45:30.22] What aspects of an organization might change once Prometheus is running?

Brian Brazil: [00:45:37.20] Because you can now alert on service-level metrics that are related to your SLA’s that you have agreed with your customers, you can turn off a lot of alerts that are waking people up in the middle of the night that aren’t directly related to that. That means you don’t need to have humans in rooms looking at monitors and being ready to get an emergency; you can scale that back and reinvest that effort instead into general engineering.

Jeff Meyerson: [00:46:02.16] Interesting. How does the workflow of an on-call person or maybe an SRE change with Prometheus?

Brian Brazil: [00:46:12.23] I don’t think the workflow would change too much. It might have links to more useful dashboards, because it’s got a richer set of metrics, depending on what system you’re looking at. The process is mostly the same, because once you’re on call, it’s instant management.

Jeff Meyerson: [00:46:27.23] Do you have any knowledge to what degree Prometheus is a rewrite of Borgmon and in what ways is it different from Borgmon?

Brian Brazil: [00:46:39.07] There aren’t too many public details around Borgmon. There’s a bit in the SRE book and some [unintelligible 00:46:44.07] so I’m not going to speculate too much.

Jeff Meyerson: [00:46:47.24] Okay, fair enough. What’s in the future for the Prometheus project?

Brian Brazil: [00:46:50.29] The main things we’re looking at the moment, there’s the long-term storage [unintelligible 00:46:54.28]; we want to get that in there, because it’s a big user request. We’re also looking at making the alert manager highly available, because right now it’s technically a single point of failure. We’re looking at an eventually consistent approach for that, so that it will continue to work in a narrow partition. Those are the two big things. Then there’s continuous improvement of client libraries, more exporters, more features in Prometheus and so on.

Jeff Meyerson: [00:47:21.21] Can you talk a little bit more about the alerting system? We didn’t discuss alerts very much, and alerting is closely related to monitoring. If you could talk about alerting and what the unique requirements of an alerting system around Prometheus are.

Brian Brazil: [00:47:39.13] In Prometheus labels are one of the key aspects, and the labels propagate all the way through. You can root alerts based on labels. Normally, the idea would be that you would have an alert manager per company, or a cluster of alert managers per company, and with one shared local configuration. That means that different Prometheus servers can send alerts to the other teams. If I’m the infrastructure Cassandra service, I can say, “Hey, you’re running out of quota”, or whatever like that.

[00:48:13.04] You basically have a tree that’s [unintelligible 00:48:12.20] and different people can say, “Okay, this is MySQL service. They like to go to Slack and PagerDuty. This over here is the Cassandra service; they’re using HipChat and OpsGenie”, and you can root like that.

The other big thing is that we have grouping. A single alert comes in, it results in one notification to PagerDuty. But if there’s a “machine down” alert, because you still need someone to fix that, you can have that come as a single notification even if a hundred machines die. That’s a massively reduced number of notifications, so you can imagine your pager isn’t jumping around on the table for five minutes as it’s processing those. Instead, you get a single notification saying, “Hey, there’s a hundred machines”, and that means that you’re less likely to be overloaded in an emergency, and have a better idea as to what’s going on.

[00:49:06.00] The other advantage is you can do this anyway. You probably want to aggregate machines per data center, but let’s say you had some system that’s replicating data globally, and you had a “the data is stale” alert. You can aggregate that by alert name, because the chances are if the ops stream data provider is having a problem, that all the data centers are going to alert at the same time. So you can consolidate that down to one notification, and getting closer to the idea of one notification per root calls, which is nice.

[00:49:34.29] The other thing the alert manager has that you need is silences. In advance of a maintenance, you can put in some labels and say, “Hey, if it matches these, don’t alert for the next two hours.” That’s really useful as you expand out.

Some of the more interesting aspects of the alert manager is performance. When you have labels and metrics, it’s easy to create many alerts, so you need to make sure that’s pretty highly performant. You could easily have tens of thousands of alerts active at a single time.

Jeff Meyerson: [00:50:06.05] Brian, I think this is a reasonable place to stop. Where can people find out more about you and your work?

Brian Brazil: [00:50:15.06] The main place to look at would be the Robust Perception website, in particular the blog there. I write a post usually on Prometheus once a week, and as well for Prometheus generally, the Prometheus that I own has everything.

Jeff Meyerson: [00:50:26.24] What kinds of things have you written about recently?

Brian Brazil: [00:50:29.10] I looked a bit more about why we don’t have one agent per machine, but rather an exporter per team. I have looked at monitoring for consensus, I’ve looked into how counters worked… I’ve written so many posts at this point, it’s hard to keep them all in line.

Jeff Meyerson: [00:50:49.26] That’s great, cool. That’s a good reason for people to go check out your blog. Brian, I want to thank you for coming on the show and talking about Prometheus in detail, I really appreciate it.

Brian Brazil: [00:50:59.29] Thank you.

Join the discussion

You must be logged in to post a comment.

1 comment

SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering : Software Engineering Radio says:

December 6, 2016 at 9:02 pm

[…] Episode 270: Brian Brazil on Prometheus Monitoring http://www.se-radio.net/2016/10/se-radio-episode-270-brian-brazil-on-prometheus-monitoring/ […]

SE Radio 270: Brian Brazil on Prometheus Monitoring

Show Notes

Related Links:

Transcript

Join the discussion

1 comment

More from this show

SE Radio 675: Brian Demers on Observability into the Toolchain

SE Radio 674: Vilhelm von Ehrenheim on Autonomous Testing

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

Menu

Recent posts