Choosing a Solution for Metrics Storage and Query
Every cluster needs to collect metrics. They are the basis of any alerting system we might want to employ. Without the information about the current and the past state of a cluster, we would not be able to react to problems when they occur nor would we be able to prevent them from happening in the first place. Actually, that is not entirely accurate. We could do all those things, but not in a way that is efficient and scalable.
A good analogy is blindness. Being blind does not mean that we cannot feel our way through an environment. Similarly, we are not helpless without a way to collect and query metrics. We can SSH into each of the nodes and check the system manually. We can start by fiddling with top, mem, df, and other commands. We can check the status of the containers with the docker stats command. We can go from one container to another and check their logs. We can do all those things, but such an approach does not scale. We cannot increase the number of operators with the same rhythm as the number of servers. We cannot convert ourselves into human machines. Even if we could, we would be terrible at it. That's why we have tools to help us. And, if they do not fulfill our needs, we can build our own solutions on top of them.
There are many tools we can choose. It would be impossible to compare them all, so we'll limit the scope to only a handful.
We'll focus on open source projects only. Some of the tools we'll discuss have a paid enterprise offering in the form of additional features. We'll exclude them from the comparison. The reason behind the exclusion lies in my belief that we should always start with open source software, get comfortable with it, and only once it proves its worth, evaluate whether it is worthwhile switching to the enterprise version.
Moreover, we'll introduce one more limitation. We will explore only the solutions that we can host ourselves. That excludes hosted services like, for example, Scout (https://scoutapp.com/) or DataDog (https://www.datadoghq.com/). The reason behind such a decision is two-fold. Many organizations are not willing to "give" their data to a third-party hosted service. Even if there is no such restriction, a hosted service would need to be able to send alerts back to our system and that would be a huge security breach. If neither of those matters to you, they are not flexible enough. None of the services I know will give us enough flexibility to build a self-adapting and self-healing system. Besides, the purpose of this book is to give you free solutions, hence the insistence on open source solutions that you can host yourself.
That does not mean that paid software is not worth the price nor that we should not use, and pay for, hosted service. Quite the contrary. However, I felt it would be better to start with things we can build ourselves and explore the limits. From there on, you will have a better understanding what you need and whether paying money for that is worthwhile.
You might be able to guess which tool will be chosen. Nevertheless, this chapter provides a more detailed explanation behind the choice. I think that the overview that follows is important since it provides a short description of the types of the solutions for storing and querying metrics as well as some of the pros and cons of some of the tools on the market.