Application health checking and probing have existed since the dawn of computer science. Usually seen as a trivial task, health checking becomes more involved when applied to distributed cloud-native apps.
In this talk we will explore the challenges and perils of modern health checking, provide an overview of how the modern distributed systems (such as AWS, Apache Mesos, Kubernetes) tackle the problem, and will share some practical recommendations based on our experience revamping Apache Mesos' health checking.
What people usually understand by "health checks" is a simple sequence: performing a specific action and judging whether the target application is healthy based on the outcome. This simple sequence becomes trickier when the application consists of multiple containers managed by a cluster orchestrator and monitored by third party tooling. Here are some of the questions that arise:
* What entity should interprete the result? Should the reasoning about the health of a task be done locally (less context) or globally (greater overhead)?
* How often should health status be delivered to balance excessive network overhead against an up-to-date status?
* Should health checks be aware of environment-specific intricacies such as namespaces and software defined networks?
* How to keep the overhead imposed by health checks manageable and reasonable?
Container basics (e.g. Docker), cluster orchestrator basics (e.g. Mesos, Kubernetes).
The attendees will learn what problems arise when a distributed containerized application is to be health checked. Beyond being domain specific and technical, hopefully this talk will also give folks food for thought about how we can design more reliable and efficient distributed systems.
is an Apache committer and Mesos PMC member at Mesosphere. He loves making programs run faster, reducing the cognitive load of code, and dealing with right abstractions. In a previous life Alex was segmenting medical images and investigating behaviour of human vessels in several German research institutes. His areas of interests include distributed systems, object recognition, probabilistic and heuristic algorithms.