Operating 100+ Kubernetes Clusters in Production. Postmortems on a Daily Basis!

When you run Kubernetes in production and at scale, you encounter many issues that challenge the reliability of your workloads as well as your development workflows. Some of these issues come with time and increased usage and size of clusters as well as amount of workloads, some might only come once you go global and into regions that have vastly different technology landscapes like China.

This talk goes into detail on learnings from concurrently operating 100+ clusters for big enterprises in production on different clouds as well as on-premise data centers around the globe. Over the years we have fixed hundreds of post mortems and want to share both operations and development best-practices that can help avoid the issues we ran into. A focus of this talk is getting towards a hardened, reliable, and easily upgradable cluster setup.


User level knowledge about Kubernetes, Docker and Networking.


Provide overview of day-to-day work that lets Giant Swarm run 100+ clusters in production. Useful for people operating clusters on their own.



ab 8.30 Uhr Registrierung und Begrüßungskaffee

9.30 Uhr Beginn


Machine Learning

  • Was ist Machine Learning?
  • Der typische ML Workflow
  • Was sind neuronale Netze?
  • Jupyter Lab mit Python
  • Eine Einführung in TensorFlow
  • Keras als High-Level API für TensorFlow

Praxisteil: Deep Learning Modelle mit Keras

  • Datengeneratoren
  • Datasets explorativ analysieren
  • Hold-Out vs. Cross Validation

11.00 - 11.15 Uhr: Kaffeepause

Praxisteil: Deep Learning Modelle mit Keras

  • Feed-Forward Netzarchitektur
  • Convolutional Neural Networks als Deep Learning Ansatz
  • Evaluation und Visualisierung des Modells

12.30 - 13.30 Uhr: Mittagspause

Pipelines mit Luigi

  • Anforderungen an produktive Modelle
  • Übersicht über Luigi und dessen Module
  • Bau eines Beispiel-Workflows

Praxisteil: Den Keras-Workflow mit Luigi implementieren

  • Anforderungen an produktive Modelle
  • Übersicht über Luigi und dessen Module
  • Bau eines Beispiel-Workflows

15.30 - 15.45 Uhr: Kaffeepause

Praxisteil: TensorFlow-Serving

  • Übersicht über TensorFlow-Serving
  • Ladestrategien konfigurieren
  • Deployment des Modells

ca. 17.00 Uhr: Ende




Roman Sokolkov Roman Sokolkov is a Site Reliability Engineer at Giant Swarm with more than 9 years of experience in infrastructure field. Prior to Giant Swarm Roman was building and operating OpenStack clouds across the globe with Mirantis.





Sie möchten über die Continuous Lifecycle
auf dem Laufenden gehalten werden?