Kubernetes HA

Desgin docs for Kubernetes HA

4 minute read

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Kubernetes HA design document

Affected version: 0.6.x

1. Goals

Provide highly-available control-plane version of Kubernetes.

2. Cluster components

2.1 Load balancer

2.1.1 External

Kubernetes HA cluster needs single TCP load-balancer to communicate from nodes to masters and from masters to masters (all internal communication has to go through the load-balancer).

Kubernetes HA - external LB

PROS:

standard solution

CONS:

it's not enough just to create one instance of such load-balancer, it needs failover logic (like virtual IP), so in the end for fully highly-available setup we need automation for whole new service
requires additional dedicated virtual machines (at least 2 for HA) even in the case of single-control-plane cluster
probably requires infrastructure that can handle virtual IP (depending on the solution for failover)

2.1.2 Internal

Following the idea from kubespray's HA-mode we can skip creation of dedicated external load-balancer (2.1.1).

Instead, we can create identical instances of lightweight load-balancer (like HAProxy) on each master and each kubelet node.

Kubernetes HA - internal LB

PROS:

no need for creation of dedicated load-balancer clusters with failover logic
since we could say that internal load-balancer is replicated, it seems to be highly-available by definition

CONS:

increased network traffic
longer provisioning times as (in case of any changes in load-balancer's configs) provisioning needs to touch every node in the cluster (master and kubelet node)
debugging load-balancer issues may become slightly harder

2.2 Etcd cluster

2.2.1 External

Kubernetes HA - external ETCD

PROS:

in the case of high network / system load external etcd cluster deployed on dedicated premium quality virtual machines will behave more stable

CONS:

requires automation for creation and distribution of etcd's server and client PKI certificates
upgrading etcd is difficult and requires well-tested autmation that works on multiple nodes at once in perfect coordination - in the case when etcd's quorum fails, it is unable to auto-heal itself and it requires to be reconstructed from scratch (where data loss or discrepancy seems to be likely)

2.2.2 Internal

Kubernetes HA - internal ETCD

PROS:

adding / removing etcd nodes is completely automated and behaves as expected (via kubeadm)
etcd's PKI is automatically re-distributed during joining new masters to control-plane

CONS:

etcd is deployed in containers alongside other internal components, which may impact its stability when system / network load is high
since etcd is containerized it may be prone to docker-related issues

3. Legacy single-master solution

After HA logic is implemented, it is probably better to reuse new codebase also for single-master clusters.

In the case of using internal load-balancer (2.1.2) it makes sense to use scaled-down (to single node) HA cluster (with single-backended load-balancer) and drop legacy code.

4. Use cases

The LambdaStack delivers highly-available Kubernetes clusters deploying them across multiple availability zones / regions to increase stability of production environments.

5. Example use

kind: lambdastack-cluster
title: "LambdaStack Cluster Config"
provider: any
name: "k8s1"
build_path: # Dynamically built
specification:
  name: k8s1
  admin_user:
    name: ubuntu
    key_path: id_ed25519
    path: # Dynamically built
  components:
    kubernetes_master:
      count: 3
      machines:
        - default-k8s-master1
        - default-k8s-master2
        - default-k8s-master3
    kubernetes_node:
      count: 2
      machines:
        - default-k8s-node1
        - default-k8s-node2
    logging:
      count: 0
    monitoring:
      count: 0
    kafka:
      count: 0
    postgresql:
      count: 0
    load_balancer:
      count: 0
    rabbitmq:
      count: 0
---
kind: infrastructure/machine
provider: any
name: default-k8s-master1
specification:
  hostname: k1m1
  ip: 10.10.1.148
---
kind: infrastructure/machine
provider: any
name: default-k8s-master2
specification:
  hostname: k1m2
  ip: 10.10.2.129
---
kind: infrastructure/machine
provider: any
name: default-k8s-master3
specification:
  hostname: k1m3
  ip: 10.10.3.16
---
kind: infrastructure/machine
provider: any
name: default-k8s-node1
specification:
  hostname: k1c1
  ip: 10.10.1.208
---
kind: infrastructure/machine
provider: any
name: default-k8s-node2
specification:
  hostname: k1c2
  ip: 10.10.2.168

6. Design proposal

As for the design proposal, the simplest solution is to take internal load-balancer (2.1.2) and internal etcd (2.2.2) and merge them together, then carefully observe and tune network traffic comming from haproxy instances for big number of worker nodes.

Kubernetes HA - internal LB

Example HAProxy config:

global
    log /dev/log local0
    log /dev/log local1 notice
    daemon

defaults
    log global
    retries 3
    maxconn 2000
    timeout connect 5s
    timeout client 120s
    timeout server 120s

frontend k8s
    mode tcp
    bind 0.0.0.0:3446
    default_backend k8s

backend k8s
    mode tcp
    balance roundrobin
    option tcp-check

    server k1m1 10.10.1.148:6443 check port 6443
    server k1m2 10.10.2.129:6443 check port 6443
    server k1m3 10.10.3.16:6443 check port 6443

Example ClusterConfiguration:

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: v1.14.6
controlPlaneEndpoint: "localhost:3446"
apiServer:
  extraArgs: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
    audit-log-maxbackup: "10"
    audit-log-maxsize: "200"
    audit-log-path: "/var/log/apiserver/audit.log"
    enable-admission-plugins: "AlwaysPullImages,DenyEscalatingExec,NamespaceLifecycle,ServiceAccount,NodeRestriction"
    profiling: "False"
controllerManager:
  extraArgs: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
    profiling: "False"
    terminated-pod-gc-threshold: "200"
scheduler:
  extraArgs: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
    profiling: "False"
networking:
  dnsDomain: cluster.local
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/12
certificatesDir: /etc/kubernetes/pki

To deploy first master run (Kubernetes 1.14):

$ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yml --experimental-upload-certs

To add one more master run (Kubernetes 1.14):

$ sudo kubeadm join localhost:3446 \
         --token 932b4p.n6teb53a6pd1rinq \
         --discovery-token-ca-cert-hash sha256:bafb8972fe97c2ef84c6ac3efd86fdfd76207cab9439f2adbc4b53cd9b8860e6 \
         --experimental-control-plane --certificate-key f1d2de1e5316233c078198a610c117c65e4e45726150d63e68ff15915ea8574a

To remove one master run (it will properly cleanup config inside Kubernetes - do not use kubectl delete node):

$ sudo kubeadm reset --force

In later versions (Kubernetes 1.17) this feature became stable and "experimental" word in the commandline paremeters was removed.

7. Post-implementation erratum

It turned out, that init-phase upload-certs does not take into account etcd encryption feature and does not copy such configuration to newly joined masters.
Instead, for consistency, master joining has been implemented via automated replication of Kubernetes PKI in ansible.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.