Non-dimensional versus dimensional metrics
Before we explore the tools we'll choose from, we should discuss different approaches to storing and collecting metrics.
We can divide the tools by dimensions. Some can store data with dimensions while others cannot. Representatives of those that are dimensionless would be Graphite and Nagios. Truth be told, there is a semblance of dimensions in Graphite, but they are so limited in their nature that we'll treat it as dimensionless. Some of the solutions that do support dimensions are, for example, InfluxDB and Prometheus. The former supports them in the form of key/value pairs while the latter uses labels.
Non-dimensional (or dimensionless) metric storage belongs to the old world when servers were relatively static, and the number of targets that were monitored was relatively small. That can be seen from the time those tools were created. Both Nagios and Graphite are older tools than InfluxDB and Prometheus.
Why are dimensions relevant? Query language needs them to be effective. Without dimensions, the language is bound to be limited in its capabilities. That does not mean that we always need dimensions. For a simple monitoring, they might be an overhead. However, running a scalable cluster where services are continuously deployed, scaled, updated, and moved around is far from simple. We need metrics that can represent all the dimensions of our cluster and the services running on top of it. A dynamic system requires dynamic analytics, and that is accomplished with metrics that include dimensions.
An example of a dimensionless metric would be container_memory_usage. Compare that with container_memory_usage{service_name="my-service", task_name="my-service.2.###", memory_limit="20000000", ...}". The latter example provides much more freedom. We can calculate average memory usage as we'd do with dimensionless but we can also deduce what the memory limit is, what is the name of the service, which replica (task) it is, and so on, and so forth.
Are dimensions (or lack of them) the only thing that distinguishes tools for storing and analyzing metrics? Among others, the way those metrics end up in a database makes a difference that might be significant. Some of the tools expect data to be pushed while others will pull (or scrape) them.
If we stick with the tools we mentioned previously, representatives of a push method would be Graphite and InfluxDB, while Nagios and Prometheus would belong to the pull group.
Those that fall into the push category are expecting data to come to them. They are passive (at least when metrics gathering is concerned). Each of the services that collect data is supposed to push them into one central location. Popular examples would be collectD and statsD. Pull system, on the other hand, is active. It will scrape data from all specified targets. Data collectors do not know about the existence of the database. Their only purpose is to gather data and expose them through a protocol acceptable to the system that will pull them.
A discussion about pros and cons of each system is raging for quite some time. There are many arguments in favor of one over the other system, and we could spend a lot of time going through all of them. Instead, we'll discuss discovery, the argument that is, in my opinion, the most relevant.
With the push system, discovery is easy. All that data collectors need to know is the address of the metrics storage and push data. As long as that address keeps being operational, the configuration is very straight forward. With the pull system, the system needs to know the location of all the data collectors (or exporters). When there are only a few, that is easy to configure. If that number jumps to tens, hundreds, or even thousands of targets, the configuration can become very tedious. That situation clearly favors the push model. But, technology changed. We have reliable systems that provide service discovery. Docker Swarm, for example, has it baked in as part of Docker Engine. Finding targets is easy and, assuming that we trust service discovery, we always have up to date information about all the data collectors.
With a proper service discovery in place, pull versus push debate becomes, more or less, irrelevant. That brings us to an argument that makes pull more appealing. It is much easier to discover a failed instance or a missing service when pulling data. When a system expects data collectors to push data, it is oblivious whether something is missing. We can summarize the problem with "I don't know what I don't know." Pull systems, on the other hand, know what to expect. They know what their targets are and it is very easy to deduce that when a scraping target does not respond, the likely cause is that it stopped working.
Neither of the arguments for push or pull are definitive, and we should not make a choice only based on that criteria. Instead, we'll explore the tools we discussed a bit more.
The first one on the list is Graphite.