When you run Kubernetes in production and at scale, you encounter many issues that challenge the reliability of your workloads as well as your development workflows. Some of these issues come with time and increased usage and size of clusters as well as amount of workloads, some might only come once you go global and into regions that have vastly different technology landscapes like China.
This talk goes into detail on learnings from concurrently operating 100+ clusters for big enterprises in production on different clouds as well as on-premise data centers around the globe. Over the years we have fixed hundreds of post mortems and want to share both operations and development best-practices that can help avoid the issues we ran into. A focus of this talk is getting towards a hardened, reliable, and easily upgradable cluster setup.
User level knowledge about Kubernetes, Docker and Networking.
Provide overview of day-to-day work that lets Giant Swarm run 100+ clusters in production. Useful for people operating clusters on their own.
is a Site Reliability Engineer at Giant Swarm with more than 9 years of experience in infrastructure field. Prior to Giant Swarm Roman was building and operating OpenStack clouds across the globe with Mirantis.