This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Design Docs

This section of the LambdaStack documentation contains references to the design documents used or planned on being used

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

1 - ARM

Desgin docs for ARM processor development

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack ARM design document

Affected version: 1.1.x

Goals

This document outlines an aproach to add (partial) ARM support to LambdaStack. The requirements:

  • ARMv8/ARM64 architecture
  • Centos 7
  • "any" provider as we do not want to provide ARM infrastructure on any cloud providers yet through Terraform
  • LambdaStack components needed ordered by priority:
    1. Kubernetes
    2. Kafka
    3. HAProxy
    4. Keycloak (This is the only deployment we need to support from the applications role)
    5. PostgreSQL (Would only be used by keycloak and does not needs to support a single deployment)
    6. RabbitMQ
    7. Logging (ELK + filebeat)
    8. Monitoring (Prometheus + Grafana + Exporters) Initial research here shows additional information about available packages and effected roles for each component.

Approach

The 2 high level approaches that have been opted so far:

  1. Add “architecture” flag when using LambdaStack
  2. Add new OS (CentosARM64 fe.)

Have 2 big disadvanges from the start:

  1. Will require an additional input which makes things more confusing as they will need supply not only the OS but also Architecture for (offline) install. This should not be needed as we can detect the architecture we are working on, on all required levels.
  2. Does not require additional input but this will lead to code duplication in the repository role as we need to maintain download-requirements.sh for each OS and architecture then.

That is why I opt for an approach where we don't add any architecture flag or new additional OS. The architecture we can handle on the code level and on the OS level only the requirements.txt might be different for each as indicated by initial research here.

Changes required

Repostitory role

In the repository role we need to change the download of the requirements to support additional architectures as download requirements might be different as:

  • Some components/roles might not have packages/binaries/containers that support ARM
  • Some filenames for binaries will be different per architecture
  • Some package repositories will have different URLs per architecture

Hence we should make a requirements.txt for each architecture we want to support, for example:

  • requirements_x86_64.txt (Should be the default and present)
  • requirements_arm64.txt

The download-requirements.sh script should be able to figure out which one to select based on the output of:

uname -i

Download role

In the download role, which is used to download plain files from the repository, we should add support for filename patterns and automatically look for current architecture (optionally with regex based suffix like linux[_-]amd64\.(tar\.gz|tar|zip)):

For example select between:

  • haproxy_exporter-0.12.0.linux-x86_64.tar.gz
  • haproxy_exporter-0.12.0.linux-arm64.tar.gz

based on ansible_architecture fact.

Note that this should be optional as some filenames do not contain architecture like Java based packages for example.

Artitecture support for each component/role

As per current requirements not every LambdaStack component is required to support ARM and there might be cases that a component/role can't support ARM as indicated by initial research here.

Thats why every component/role should be marked which architecture it supports. Maybe something in <rolename>/defaults/main.yml like:

supported_architectures:
  - all ?
  - x86_64
  - arm64

We can assume the role/component will support everything if all is defined or if supported_architectures is not present.

Pre-flight check

The preflight should be expanded to check if all the components/roles we want to install from the inventory actually support the architecture we want to use. We should be able to do this with the definition from the above point. This way we will make sure people can only install components on ARM which we actually support.

Replace Skopeo with Crane

Currently we use Skopeo to download the image requirements. Skopeo however has the following issues with newer versions:

  • No support anymore for universal Go binaries. Each OS would need to have each own build version
  • Sketchy support for ARM64

That is why we should replace it with Crane.

  1. This tool can do the same as Skopeo:
./skopeo --insecure-policy copy docker://kubernetesui/dashboard:v2.3.1 docker-archive:skopeodashboard:v2.3.1
./crane pull --insecure kubernetesui/dashboard:v2.3.1 dashboard.tar

The above will produce the same Docker image package.

  1. Supports the universal cross distro binary.
  2. Has support for both ARM64 and x86_64.
  3. Has official pre-build binaries, unlike Skopeo.

1.1 - CentOS ARM Analysis

Desgin docs for CentOS ARM processor development

CentOS requirements.txt ARM analysis

Packages

Name ARM Supported Info Required
apr + +
apr-util + +
centos-logos + ?
createrepo + +
deltarpm + +
httpd + +
httpd-tools + +
libxml2-python + +
mailcap + +
mod_ssl + +
python-chardet + +
python-deltarpm + +
python-kitchen + +
yum-utils + +
audit + +
bash-completion + +
c-ares + ---
ca-certificates + +
cifs-utils + +
conntrack-tools + +
containerd.io + +
container-selinux + ?
cri-tools-1.13.0 + ?
curl + +
dejavu-sans-fonts + +
docker-ce-19.03.14 + +
docker-ce-cli-19.03.14 + +
ebtables + +
elasticsearch-curator-5.8.3 --- elasticsearch-curator-3.5.1 (from separate repo v3) +
elasticsearch-oss-7.9.1 + +
erlang-23.1.4 + +
ethtool + +
filebeat-7.9.2 + +
firewalld + +
fontconfig + +
fping + +
gnutls + +
grafana-7.3.5 + +
gssproxy + +
htop + +
iftop + +
ipset + +
java-1.8.0-openjdk-headless + +
javapackages-tools + +
jq + +
libini_config + +
libselinux-python + +
libsemanage-python + +
libX11 + +
libxcb + +
libXcursor + +
libXt + +
logrotate + +
logstash-oss-7.8.1 + +
net-tools + +
nfs-utils + +
nmap-ncat + ?
opendistro-alerting-1.10.1* + +
opendistro-index-management-1.10.1* + +
opendistro-job-scheduler-1.10.1* + +
opendistro-performance-analyzer-1.10.1* + +
opendistro-security-1.10.1* + +
opendistro-sql-1.10.1* + +
opendistroforelasticsearch-kibana-1.10.1* --- opendistroforelasticsearch-kibana-1.13.0 +
openssl + +
perl + +
perl-Getopt-Long + +
perl-libs + +
perl-Pod-Perldoc + +
perl-Pod-Simple + +
perl-Pod-Usage + +
pgaudit12_10 + ---
pgbouncer-1.10.* --- ---
pyldb + +
python-firewall + +
python-kitchen + +
python-lxml + +
python-psycopg2 + +
python-setuptools + ?
python-slip-dbus + +
python-ipaddress + ?
python-backports + ?
quota + ?
rabbitmq-server-3.8.9 + +
rh-haproxy18 --- ---
rh-haproxy18-haproxy-syspaths --- ---
postgresql10-server + +
repmgr10-4.0.6 --- ---
samba-client + +
samba-client-libs + +
samba-common + +
samba-libs + +
sysstat + +
tar + +
telnet + +
tmux + +
urw-base35-fonts + +
unzip + +
vim-common + +
vim-enhanced + +
wget + +
xorg-x11-font-utils + +
xorg-x11-server-utils + +
yum-plugin-versionlock + +
yum-utils + +
rsync + +
kubeadm-1.18.6 + +
kubectl-1.18.6 + +
kubelet-1.18.6 + +
kubernetes-cni-0.8.6-0 + +

Files

Name ARM Supported Info Required
https://github.com/prometheus/haproxy_exporter/releases/download/v0.10.0/haproxy_exporter-0.10.0.linux-arm64.tar.gz + dedicated package +
https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.14.0/jmx_prometheus_javaagent-0.14.0.jar + jar +
https://archive.apache.org/dist/kafka/2.6.0/kafka_2.12-2.6.0.tgz + shell scripts + jar libraries +
https://github.com/danielqsj/kafka_exporter/releases/download/v1.2.0/kafka_exporter-1.2.0.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/prometheus/releases/download/v2.10.0/prometheus-2.10.0.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/alertmanager/releases/download/v0.17.0/alertmanager-0.17.0.linux-arm64.tar.gz + dedicated package +
https://archive.apache.org/dist/zookeeper/zookeeper-3.5.8/apache-zookeeper-3.5.8-bin.tar.gz + shell scripts + jar libraries ---
https://archive.apache.org/dist/ignite/2.9.1/apache-ignite-2.9.1-bin.zip + shell scripts + jar libraries ---
https://releases.hashicorp.com/vault/1.7.0/vault_1.7.0_linux_arm64.zip + dedicated package ---
https://get.helm.sh/helm-v3.2.0-linux-arm64.tar.gz + dedicated package ---
https://github.com/hashicorp/vault-helm/archive/v0.9.0.tar.gz + yaml files ---
https://github.com/wrouesnel/postgres_exporter/releases/download/v0.8.0/postgres_exporter_v0.8.0_linux-amd64.tar.gz --- +
https://charts.bitnami.com/bitnami/node-exporter-1.1.2.tgz + yaml files +
https://helm.elastic.co/helm/filebeat/filebeat-7.9.2.tgz + yaml files +

Images

Name ARM Supported Info Required
haproxy:2.2.2-alpine + arm64v8/haproxy +
kubernetesui/dashboard:v2.3.1 + +
kubernetesui/metrics-scraper:v1.0.7 + +
registry:2 +
hashicorp/vault-k8s:0.7.0 --- https://hub.docker.com/r/moikot/vault-k8s / custom build ---
vault:1.7.0 + ---
apacheignite/ignite:2.9.1 --- https://github.com/apache/ignite/tree/master/docker/apache-ignite / custom build ---
bitnami/pgpool:4.1.1-debian-10-r29 --- ---
brainsam/pgbouncer:1.12 --- ---
istio/pilot:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
istio/proxyv2:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
istio/operator:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
jboss/keycloak:4.8.3.Final --- +
jboss/keycloak:9.0.0 --- +
rabbitmq:3.8.9 + +
coredns/coredns:1.5.0 + +
quay.io/coreos/flannel:v0.11.0 + +
calico/cni:v3.8.1 + +
calico/kube-controllers:v3.8.1 + +
calico/node:v3.8.1 + +
calico/pod2daemon-flexvol:v3.8.1 + +
k8s.gcr.io/kube-apiserver:v1.18.6 + k8s.gcr.io/kube-apiserver-arm64:v1.18.6 +
k8s.gcr.io/kube-controller-manager:v1.18.6 + k8s.gcr.io/kube-controller-manager-arm64:v1.18.6 +
k8s.gcr.io/kube-scheduler:v1.18.6 + k8s.gcr.io/kube-scheduler-arm64:v1.18.6 +
k8s.gcr.io/kube-proxy:v1.18.6 + k8s.gcr.io/kube-proxy-arm64:v1.18.6 +
k8s.gcr.io/coredns:1.6.7 --- coredns/coredns:1.6.7 +
k8s.gcr.io/etcd:3.4.3-0 + k8s.gcr.io/etcd-arm64:3.4.3-0 +
k8s.gcr.io/pause:3.2 + k8s.gcr.io/pause-arm64:3.2 +

Custom builds

Build multi arch image for Keycloak 9:

Clone repo: https://github.com/keycloak/keycloak-containers/

Checkout tag: 9.0.0

Change dir to: keycloak-containers/server

Create new builder: docker buildx create --name mybuilder

Switch to builder: docker buildx use mybuilder

Inspect builder and make sure it supports linux/amd64, linux/arm64: docker buildx inspect --bootstrap

Build and push container: docker buildx build --platform linux/amd64,linux/arm64 -t repo/keycloak:9.0.0 --push .


Additional info:

https://hub.docker.com/r/jboss/keycloak/dockerfile

https://github.com/keycloak/keycloak-containers/

https://catalog.redhat.com/software/containers/ubi8/ubi-minimal/5c359a62bed8bd75a2c3fba8?architecture=arm64&container-tabs=overview

https://docs.docker.com/docker-for-mac/multi-arch/

Components to roles mapping

Component name Roles
Repository repository
image-registry
node-exporter
firewall
filebeat
docker
Kubernetes kubernetes-master
kubernetes-node
applications
node-exporter
haproxy_runc
kubernetes_common
Kafka zookeeper
jmx-exporter
kafka
kafka-exporter
node-exporter
ELK (Logging) logging
elasticsearch
elasticsearch_curator
logstash
kibana
node-exporter
Exporters node-exporter
kafka-exporter
jmx-exporter
haproxy-exporter
postgres-exporter
PostgreSQL postgresql
postgres-exporter
node-exporter
Keycloak applications
RabbitMQ rabbitmq
node-exporter
HAProxy haproxy
haproxy-exporter
node-exporter
haproxy_runc
Monitoring prometheus
grafana
node-exporter

Except above table, components require following roles to be checked:

  • upgrade
  • backup
  • download
  • firewall
  • filebeat
  • recovery (n/a kubernetes)

1.2 - RedHat ARM Analysis

Desgin docs for RedHat ARM processor development

RedHat requirements.txt ARM analysis

Packages

Name ARM Supported Info Required
apr + +
apr-util + +
redhat-logos + ?
createrepo + +
deltarpm + +
httpd + +
httpd-tools + +
libxml2-python + +
mailcap + +
mod_ssl + +
python-chardet + +
python-deltarpm + +
python-kitchen + +
yum-utils + +
audit + +
bash-completion + +
c-ares + ---
ca-certificates + +
cifs-utils + +
conntrack-tools + +
containerd.io + +
container-selinux + ?
cri-tools-1.13.0 + ?
curl + +
dejavu-sans-fonts + +
docker-ce-19.03.14 + +
docker-ce-cli-19.03.14 + +
ebtables + +
elasticsearch-curator-5.8.3 --- elasticsearch-curator-3.5.1 (from separate repo v3) +
elasticsearch-oss-7.10.2 + +
ethtool + +
filebeat-7.9.2 + +
firewalld + +
fontconfig + +
fping + +
gnutls + +
grafana-7.3.5 + +
gssproxy + +
htop + +
iftop + +
ipset + +
java-1.8.0-openjdk-headless + +
javapackages-tools + +
jq + +
libini_config + +
libselinux-python + +
libsemanage-python + +
libX11 + +
libxcb + +
libXcursor + +
libXt + +
logrotate + +
logstash-oss-7.8.1 + +
net-tools + +
nfs-utils + +
nmap-ncat + ?
opendistro-alerting-1.13.1* + +
opendistro-index-management-1.13.1* + +
opendistro-job-scheduler-1.13.1* + +
opendistro-performance-analyzer-1.13.1* + +
opendistro-security-1.13.1* + +
opendistro-sql-1.13.1* + +
opendistroforelasticsearch-kibana-1.13.1* + +
unixODBC + +
openssl + +
perl + +
perl-Getopt-Long + +
perl-libs + +
perl-Pod-Perldoc + +
perl-Pod-Simple + +
perl-Pod-Usage + +
pgaudit12_10 ? ---
pgbouncer-1.10.* ? ---
policycoreutils-python + +
pyldb + +
python-cffi + +
python-firewall + +
python-kitchen + +
python-lxml + +
python-psycopg2 + +
python-pycparser + +
python-setuptools + ?
python-slip-dbus + +
python-ipaddress + ?
python-backports + ?
quota + ?
rabbitmq-server-3.8.9 + +
rh-haproxy18 --- ---
rh-haproxy18-haproxy-syspaths --- ---
postgresql10-server + +
repmgr10-4.0.6 --- ---
samba-client + +
samba-client-libs + +
samba-common + +
samba-libs + +
sysstat + +
tar + +
telnet + +
tmux + +
urw-base35-fonts ? Need to be verified, no package found +
unzip + +
vim-common + +
vim-enhanced + +
wget + +
xorg-x11-font-utils + +
xorg-x11-server-utils + +
yum-plugin-versionlock + +
yum-utils + +
rsync + +
kubeadm-1.18.6 + +
kubectl-1.18.6 + +
kubelet-1.18.6 + +
kubernetes-cni-0.8.6-0 + +

Files

Name ARM Supported Info Required
https://packages.erlang-solutions.com/erlang/rpm/centos/7/aarch64/esl-erlang_23.1.5-1~centos~7_arm64.rpm + dedicated package +
https://github.com/prometheus/haproxy_exporter/releases/download/v0.10.0/haproxy_exporter-0.10.0.linux-arm64.tar.gz + dedicated package +
https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.14.0/jmx_prometheus_javaagent-0.14.0.jar + jar +
https://archive.apache.org/dist/kafka/2.6.0/kafka_2.12-2.6.0.tgz + shell scripts + jar libraries +
https://github.com/danielqsj/kafka_exporter/releases/download/v1.2.0/kafka_exporter-1.2.0.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/prometheus/releases/download/v2.10.0/prometheus-2.10.0.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/alertmanager/releases/download/v0.17.0/alertmanager-0.17.0.linux-arm64.tar.gz + dedicated package +
https://archive.apache.org/dist/zookeeper/zookeeper-3.5.8/apache-zookeeper-3.5.8-bin.tar.gz + shell scripts + jar libraries ---
https://archive.apache.org/dist/ignite/2.9.1/apache-ignite-2.9.1-bin.zip + shell scripts + jar libraries ---
https://releases.hashicorp.com/vault/1.7.0/vault_1.7.0_linux_arm64.zip + dedicated package ---
https://get.helm.sh/helm-v3.2.0-linux-arm64.tar.gz + dedicated package ---
https://github.com/hashicorp/vault-helm/archive/v0.9.0.tar.gz + yaml files ---
https://github.com/prometheus-community/postgres_exporter/releases/download/v0.9.0/postgres_exporter-0.9.0.linux-arm64.tar.gz --- +
https://charts.bitnami.com/bitnami/node-exporter-1.1.2.tgz + yaml files +
https://helm.elastic.co/helm/filebeat/filebeat-7.9.2.tgz + yaml files +

Images

Name ARM Supported Info Required
haproxy:2.2.2-alpine + arm64v8/haproxy +
kubernetesui/dashboard:v2.3.1 + +
kubernetesui/metrics-scraper:v1.0.7 + +
registry:2 +
hashicorp/vault-k8s:0.7.0 --- https://hub.docker.com/r/moikot/vault-k8s / custom build ---
vault:1.7.0 + ---
lambdastack/keycloak:9.0.0 + custom build +
bitnami/pgpool:4.1.1-debian-10-r29 --- ---
brainsam/pgbouncer:1.12 --- ---
istio/pilot:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
istio/proxyv2:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
istio/operator:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
jboss/keycloak:4.8.3.Final --- ---
jboss/keycloak:9.0.0 --- ---
rabbitmq:3.8.9 --- ---
coredns/coredns:1.5.0 + +
quay.io/coreos/flannel:v0.11.0 + +
calico/cni:v3.8.1 + +
calico/kube-controllers:v3.8.1 + +
calico/node:v3.8.1 + +
calico/pod2daemon-flexvol:v3.8.1 + +
k8s.gcr.io/kube-apiserver:v1.18.6 + k8s.gcr.io/kube-apiserver-arm64:v1.18.6 +
k8s.gcr.io/kube-controller-manager:v1.18.6 + k8s.gcr.io/kube-controller-manager-arm64:v1.18.6 +
k8s.gcr.io/kube-scheduler:v1.18.6 + k8s.gcr.io/kube-scheduler-arm64:v1.18.6 +
k8s.gcr.io/kube-proxy:v1.18.6 + k8s.gcr.io/kube-proxy-arm64:v1.18.6 +
k8s.gcr.io/coredns:1.6.7 --- coredns/coredns:1.6.7 +
k8s.gcr.io/etcd:3.4.3-0 + k8s.gcr.io/etcd-arm64:3.4.3-0 +
k8s.gcr.io/pause:3.2 + k8s.gcr.io/pause-arm64:3.2 +

Custom builds

Build multi arch image for Keycloak 9:

Clone repo: https://github.com/keycloak/keycloak-containers/

Checkout tag: 9.0.0

Change dir to: keycloak-containers/server

Create new builder: docker buildx create --name mybuilder

Switch to builder: docker buildx use mybuilder

Inspect builder and make sure it supports linux/amd64, linux/arm64: docker buildx inspect --bootstrap

Build and push container: docker buildx build --platform linux/amd64,linux/arm64 -t repo/keycloak:9.0.0 --push .


Additional info:

https://hub.docker.com/r/jboss/keycloak/dockerfile

https://github.com/keycloak/keycloak-containers/

https://catalog.redhat.com/software/containers/ubi8/ubi-minimal/5c359a62bed8bd75a2c3fba8?architecture=arm64&container-tabs=overview

https://docs.docker.com/docker-for-mac/multi-arch/

Components to roles mapping

Component name Roles
Repository repository
image-registry
node-exporter
firewall
filebeat
docker
Kubernetes kubernetes-master
kubernetes-node
applications
node-exporter
haproxy_runc
kubernetes_common
Kafka zookeeper
jmx-exporter
kafka
kafka-exporter
node-exporter
ELK (Logging) logging
elasticsearch
elasticsearch_curator
logstash
kibana
node-exporter
Exporters node-exporter
kafka-exporter
jmx-exporter
haproxy-exporter
postgres-exporter
PostgreSQL postgresql
postgres-exporter
node-exporter
Keycloak applications
RabbitMQ rabbitmq
node-exporter
HAProxy haproxy
haproxy-exporter
node-exporter
haproxy_runc
Monitoring prometheus
grafana
node-exporter

Except above table, components require following roles to be checked:

  • backup
  • recovery (n/a kubernetes)

Known issues:

  • Postgresql repository need to be verify : "https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7Server-aarch64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found"
  • Additional repositories need to be enabled: "rhel-7-for-arm-64-extras-rhui-rpms" and "rhel-7-for-arm-64-rhui-rpms"
  • No package found for urw-base35-fonts
  • Only RHEL-7.6 and 8.x images are available for AWS

1.3 - Ubuntu ARM Analysis

Desgin docs for Ubuntu ARM processor development

Ubuntu requirements.txt ARM analysis

Packages

Name ARM Supported Info Required
adduser + +
apt-transport-https + +
auditd + +
bash-completion + +
build-essential + +
ca-certificates + +
cifs-utils + +
containerd.io + +
cri-tools + +
curl + +
docker-ce + +
docker-ce-cli + +
ebtables + +
elasticsearch-curator + +
elasticsearch-oss + +
erlang-asn1 + +
erlang-base + +
erlang-crypto + +
erlang-eldap + +
erlang-ftp + +
erlang-inets + +
erlang-mnesia + +
erlang-os-mon + +
erlang-parsetools + +
erlang-public-key + +
erlang-runtime-tools + +
erlang-snmp + +
erlang-ssl + +
erlang-syntax-tools + +
erlang-tftp + +
erlang-tools + +
erlang-xmerl + +
ethtool + +
filebeat + +
firewalld + +
fping + +
gnupg2 + +
grafana + +
haproxy + +
htop + +
iftop + +
jq + +
libfontconfig1 + +
logrotate + +
logstash-oss + +
netcat + +
net-tools + +
nfs-common + +
opendistro-alerting + +
opendistro-index-management + +
opendistro-job-scheduler + +
opendistro-performance-analyzer + +
opendistro-security + +
opendistro-sql + +
opendistroforelasticsearch-kibana + +
openjdk-8-jre-headless + +
openssl + +
postgresql-10 + +
python-pip + +
python-psycopg2 + +
python-selinux + +
python-setuptools + +
rabbitmq-server + +
smbclient + +
samba-common + +
smbclient + +
software-properties-common + +
sshpass + +
sysstat + +
tar + +
telnet + +
tmux + +
unzip + +
vim + +
rsync + +
libcurl4 + +
libnss3 + +
libcups2 + +
libavahi-client3 + +
libavahi-common3 + +
libjpeg8 + +
libfontconfig1 + +
libxtst6 + +
fontconfig-config + +
python-apt + +
python + +
python2.7 + +
python-minimal + +
python2.7-minimal + +
gcc + +
gcc-7 + +
g++ + +
g++-7 + +
dpkg-dev + +
libc6-dev + +
cpp + +
cpp-7 + +
libgcc-7-dev + +
binutils + +
gcc-8-base + +
libodbc1 + +
apache2 + +
apache2-bin + +
apache2-utils + +
libjq1 + +
gnupg + +
gpg + +
gpg-agent + +
smbclient + +
samba-libs + +
libsmbclient + +
postgresql-client-10 + +
postgresql-10-pgaudit + +
postgresql-10-repmgr + +
postgresql-common + +
pgbouncer + +
ipset + +
libipset3 + +
python3-decorator + +
python3-selinux + +
python3-slip + +
python3-slip-dbus + +
libpq5 + +
python3-psycopg2 + +
python3-jmespath + +
libpython3.6 + +
python-cryptography + +
python-asn1crypto + +
python-cffi-backend + +
python-enum34 + +
python-idna + +
python-ipaddress + +
python-six + +
kubeadm + +
kubectl + +
kubelet + +
kubernetes-cni + +

Files

Name ARM Supported Info Required
https://github.com/prometheus/haproxy_exporter/releases/download/v0.10.0/haproxy_exporter-0.10.0.linux-arm64.tar.gz + dedicated package +
https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.14.0/jmx_prometheus_javaagent-0.14.0.jar + jar +
https://archive.apache.org/dist/kafka/2.6.0/kafka_2.12-2.6.0.tgz + shell scripts + jar libraries +
https://github.com/danielqsj/kafka_exporter/releases/download/v1.2.0/kafka_exporter-1.2.0.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/prometheus/releases/download/v2.10.0/prometheus-2.10.0.linux-arm64.tar.gz + dedicated package +
https://github.com/prometheus/alertmanager/releases/download/v0.17.0/alertmanager-0.17.0.linux-arm64.tar.gz + dedicated package +
https://archive.apache.org/dist/zookeeper/zookeeper-3.5.8/apache-zookeeper-3.5.8-bin.tar.gz + shell scripts + jar libraries ---
https://archive.apache.org/dist/ignite/2.9.1/apache-ignite-2.9.1-bin.zip + shell scripts + jar libraries ---
https://releases.hashicorp.com/vault/1.7.0/vault_1.7.0_linux_arm64.zip + dedicated package ---
https://get.helm.sh/helm-v3.2.0-linux-arm64.tar.gz + dedicated package ---
https://github.com/hashicorp/vault-helm/archive/v0.9.0.tar.gz + yaml files ---
https://github.com/wrouesnel/postgres_exporter/releases/download/v0.8.0/postgres_exporter_v0.8.0_linux-amd64.tar.gz --- +
https://charts.bitnami.com/bitnami/node-exporter-1.1.2.tgz + yaml files +
https://helm.elastic.co/helm/filebeat/filebeat-7.9.2.tgz + yaml files +

Images

Name ARM Supported Info Required
haproxy:2.2.2-alpine + arm64v8/haproxy +
kubernetesui/dashboard:v2.3.1 + +
kubernetesui/metrics-scraper:v1.0.7 + +
registry:2 +
hashicorp/vault-k8s:0.7.0 --- https://hub.docker.com/r/moikot/vault-k8s / custom build ---
vault:1.7.0 + ---
apacheignite/ignite:2.9.1 --- https://github.com/apache/ignite/tree/master/docker/apache-ignite / custom build ---
bitnami/pgpool:4.1.1-debian-10-r29 --- ---
brainsam/pgbouncer:1.12 --- ---
istio/pilot:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
istio/proxyv2:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
istio/operator:1.8.1 --- https://github.com/istio/istio/issues/21094 / custom build ---
jboss/keycloak:4.8.3.Final --- +
jboss/keycloak:9.0.0 --- +
rabbitmq:3.8.9 + +
coredns/coredns:1.5.0 + +
quay.io/coreos/flannel:v0.11.0 + +
calico/cni:v3.8.1 + +
calico/kube-controllers:v3.8.1 + +
calico/node:v3.8.1 + +
calico/pod2daemon-flexvol:v3.8.1 + +
k8s.gcr.io/kube-apiserver:v1.18.6 + k8s.gcr.io/kube-apiserver-arm64:v1.18.6 +
k8s.gcr.io/kube-controller-manager:v1.18.6 + k8s.gcr.io/kube-controller-manager-arm64:v1.18.6 +
k8s.gcr.io/kube-scheduler:v1.18.6 + k8s.gcr.io/kube-scheduler-arm64:v1.18.6 +
k8s.gcr.io/kube-proxy:v1.18.6 + k8s.gcr.io/kube-proxy-arm64:v1.18.6 +
k8s.gcr.io/coredns:1.6.7 --- coredns/coredns:1.6.7 +
k8s.gcr.io/etcd:3.4.3-0 + k8s.gcr.io/etcd-arm64:3.4.3-0 +
k8s.gcr.io/pause:3.2 + k8s.gcr.io/pause-arm64:3.2 +

Custom builds

Build multi arch image for Keycloak 9:

Clone repo: https://github.com/keycloak/keycloak-containers/

Checkout tag: 9.0.0

Change dir to: keycloak-containers/server

Create new builder: docker buildx create --name mybuilder

Switch to builder: docker buildx use mybuilder

Inspect builder and make sure it supports linux/amd64, linux/arm64: docker buildx inspect --bootstrap

Build and push container: docker buildx build --platform linux/amd64,linux/arm64 -t repo/keycloak:9.0.0 --push .


Additional info:

https://hub.docker.com/r/jboss/keycloak/dockerfile

https://github.com/keycloak/keycloak-containers/

https://catalog.redhat.com/software/containers/ubi8/ubi-minimal/5c359a62bed8bd75a2c3fba8?architecture=arm64&container-tabs=overview

https://docs.docker.com/docker-for-mac/multi-arch/

Components to roles mapping

Component name Roles
Repository repository
image-registry
node-exporter
firewall
filebeat
docker
Kubernetes kubernetes-master
kubernetes-node
applications
node-exporter
haproxy_runc
kubernetes_common
Kafka zookeeper
jmx-exporter
kafka
kafka-exporter
node-exporter
ELK (Logging) logging
elasticsearch
elasticsearch_curator
logstash
kibana
node-exporter
Exporters node-exporter
kafka-exporter
jmx-exporter
haproxy-exporter
postgres-exporter
PostgreSQL postgresql
postgres-exporter
node-exporter
Keycloak applications
RabbitMQ rabbitmq
node-exporter
HAProxy haproxy
haproxy-exporter
node-exporter
haproxy_runc
Monitoring prometheus
grafana
node-exporter

Except above table, components require following roles to be checked:

  • upgrade
  • backup
  • download
  • firewall
  • filebeat
  • recovery (n/a kubernetes)

2 - Autoscaling

Desgin docs for Autoscaling

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Autoscaling

Affected version: 0.7.x

1. Goals

We want to provide automatic scale up / down feature for cloud-based LambdaStack clusters (currently Azure and AWS).

  • Clusters will be resized in reaction to the resource utilisation (CPU and Memory).
  • Existing LambdaStack automation will be reused and optimized for the purpose of autoscaling.
  • Additional nodes will be added (removed) to (from) running Kubernetes clusters.
  • Horizontal Pod Autoscaler will be used to control number of pods for particular deployment.

2. Design proposal

PHASE 1: Adding ability to scale-down the pool of worker nodes.

  • Current LambdaStack codebase does not allow to scale-down Kubernetes clusters in the nice / proper way.
  • This is crucial for autoscaling to work, as we need to properly drain and delete physically-destroyed nodes from Kuberentes.
  • Also this step needs to be performed before terraform code is executed (which requires a refactor of lambdastack code).

PHASE 2: Moving terraform's state and lambdastack-cluster-config to a shared place in the cloud.

  • Currently LambdaStack keeps state files and cluster configs in the build/xxx/ directories, which causes them not to be shared easily.
  • To solve the issue, terraform beckends can be used: for Azure and for AWS.
  • For simplicity the same "bucket" can be used to store and share lambdastack-cluster-config.

PHASE 3: Building packer images to quickly add new Kubernetes nodes.

  • Autoscaling is expected to react reasonably quickly. Providing pre-built images should result in great speed-ups.
  • Packer code should be added to the lambdastack codebase somewhere "before" the terraform code executes.

PHASE 4: Realistic provisioning minimalization and speedup.

  • Currently LambdaStack's automation takes lots of time to provision clusters.
  • Limits and tags can be used to filter-out unnecessary plays from ansible execution (for now, narrowing it just to the Kubernetes node provisioning).

PHASE 5: Adding ability to authenticate and run lambdastack from a pod.

  • To be able to execute lambdastack form a running LambdaStack cluster, it is required to deploy SSH keys and cloud access configuration (ie. Service Principal).
  • SSH keys can be created and distributed automatically (in Ansible) just for the purpose of autoscaling.
  • For now, it seems resonable to store them in Kubernetes secrets (later the Hashicorp Vault will be used).

PHASE 6: Introducing python application that will execute lambdastack from a pod (in reaction to performance metrics) to scale the pool of worker nodes.

  • Metrics can be obtained from the metrics server.
  • For simplicity, standard CPU / Memory metrics will be used, but later it should be posible to introduce custom metrics taken from Prometheus.
  • Best way to package and deploy the application would be to use Helm (v3).
  • The docker image for the application can be stored in a public docker registry.

PHASE 7: Introducing standard Horizontal Pod Autoscaler to scale pods in LambdaStack clusters.

  • To scale Kubernetes pods in LambdaStack clusters the Horizontal Pod Autoscaler will be used.
  • This step will be dependent and the user / customer (user will deploy and configure proper resources inside Kubernetes).

3 - AWS

Desgin docs for AWS

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack AWS support design document

Affected version: 0.3.0

Goals

Provide AWS support:

  1. Infrastructure setup automation
  2. AWS OS images support (RHEL, Ubuntu)
  3. Cluster security based on rules
  4. Virtual machines should be able to belong to different subnets within the LambdaStack cluster. Requirement is to have at least two subnets - one for Load Balancing (internet facing) and one for other components.
  5. Virtual machines should have data disk (when configured to have such)
  6. Components (Kafka, Postgresql, Prometheus, ElasticSearch) should be configured to use data disk space
  7. Cluster should not use any public IP except Load Balancer

Use cases

Support AWS cloud to not rely only on single provider.

Proposed network design

LambdaStack on AWS network design

LambdaStack on AWS will create Resource Group that will contain all cluster components. One of the resources will be Amazon VPC (Virtual Private Cloud) that is isolated section of AWS cloud. Inside of VPC, many subnets will be provisioned by LambdaStack automation - based on data provided by user or using defaults. Virtual machines and data disks will be created and placed inside a subnet.

4 - Backup

Desgin docs for Backup

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack backup design document

Affected version: 0.4.x

Goals

Provide backup functionality for LambdaStack - cluster created using lambdastack tool.

Backup will cover following areas:

  1. Kubernetes cluster backup

    1.1 etcd database

    1.2 kubeadm config

    1.3 certificates

    1.4 persistent volumes

    1.5 applications deployed on the cluster

  2. Kafka backup

    2.1 Kafka topic data

    2.2 Kafka index

    2.3 Zookeeper settings and data

  3. Elastic stack backup

    3.1 Elasticsearch data

    3.2 Kibana settings

  4. Monitoring backup

    4.1 Prometheus data

    4.2 Prometheus settings (properties, targets)

    4.3 Alertmanager settings

    4.4 Grafana settings (datasources, dashboards)

  5. PostgreSQL backup

    5.1 All databases from DB

  6. RabbitMQ settings and user data

  7. HAProxy settings backup

Use cases

User/background service/job is able to backup whole cluster or backup selected parts and store files in desired location. There are few options possible to use for storing backup:

  • S3
  • Azure file storage
  • local file
  • NFS

Application/tool will create metadata file that will be definition of the backup - information that can be useful for restore tool. This metadata file will be stored within backup file.

Backup is packed to zip/gz/tar.gz file that has timestamp in the name. If name collision occurred name+'_1' will be used.

Example use

lsbackup -b /path/to/build/dir -t /target/location/for/backup

Where -b is path to build folder that contains Ansible inventory and -t contains target path to store backup.

Backup Component View

LambdaStack backup component

User/background service/job executes lsbackup (code name) application. Application takes parameters:

  • -b: build directory of existing cluster. Most important is ansible inventory existing in this directory - so it can be assumed that this should be folder of Ansible inventory file.
  • -t: target location of zip/tar.gz file that will contain backup files and metadata file.

Tool when executed looks for the inventory file in -b location and executes backup playbooks. All playbooks are optional, in MVP version it can try to backup all components (it they exists in the inventory). After that, some components can be skipped (by providing additional flag, or parameter to cli).

Tool also produces metadata file that describes backup with time, backed up components and their versions.

1. Kubernetes cluster backup

There are few ways of doing backups of existing Kuberntes cluster. Going to take into further research two approaches.

First: Backup etcd database and kubeadm config of single master node. Instruction can be found here. Simple solution for that will backup etcd which contains all workload definitions and settings.

Second: Use 3rd party software to create a backup like Heptio Velero - Apache 2.0 license, Velero GitHub

2. Kafka backup

Possible options for backing up Kafka broker data and indexes:

  1. Mirror using Kafka Mirror Maker. It requires second Kafka cluster running independently that will replicate all data (including current offset and consumer groups). It is used mostly for multi-cloud replication.

  2. Kafka-connect – use Kafka connect to get all topic and offset data from Kafka an save to it filesystem (NFS, local, S3, ...) called Sink connector.

    2.1 Confluent Kafka connector – that use Confluent Kafka Community License Agreement
    2.2 Use another Open Source connector like kafka-connect-s3 (BSD) or kafka-backup (Apache 2.0)

  3. File system copy: take Kafka broker and ZooKeeper data stored in files and copy it to backup location. It requires Kafka Broker to be stopped. Solution described in Digital Ocean post.

3. Elastic stack backup

Use built-in features of Elasticsearch to create backup like:

PUT /_snapshot/my_unverified_backup?verify=false
{
  "type": "fs",
  "settings": {
    "location": "my_unverified_backup_location"
  }
}

More information can be found here.

OpenDistro uses similar way of doing backups - it should be compatible. OpenDistro backups link.

4. Monitoring backup

Prometheus from version 2.1 is able to create data snapshot by doing HTTP request:

curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot

Snapshot will be created in <data-dir>/snapshots/SNAPSHOT-NAME-RETURNED-IN-RESPONSE

More info

Files like targets and Prometheus/AlertManager settings should be also copied to backup location.

5. PostgreSQL backup

Relational DB backup mechanisms are the most mature ones. Simplest solution is to use standard PostgreSQL backup funtions. Valid option is also to use pg_dump.

6. RabbitMQ settings and user data

RabbitMQ has standard way of creating backup.

7. HAProxy settings backup

Copy HAProxy configuration files to backup location.

4.1 - Operational

Desgin docs for Backup Operational

LambdaStack backup design document with details

Affected version: 0.7.x

Goals

This document is extension of high level design doc: LambdaStack backup design document and describes more detailed, operational point-of-view of this case. Document does not include Kubernetes and Kafka stack

Components

lsbackup application

Example use:

lambdastack backup -b build_dir -t target_path

Where -b is path to build folder that contains Ansible inventory and -t contains target path to store backup.

backup runs tasks from ansible backup role

build_dir contains cluster's ansible inventory

target_path location to store backup, see Storage section below.

Consider to add disclaimer for user to check whether backup location has enough space to store whole backup.

Storage

Location created on master node to keep backup files. This location might be used to mount external storage, like:

  • Amazon S3
  • Azure blob
  • NFS
  • Any external disk mounted by administrator

In cloud configuration blob or S3 storage might be mounted directly on every machine in cluster and can be configured by LambdaStack. For on-prem installation it's up to administrator to attach external disk to backup location on master node. This location should be shared with other machines in cluster as NFS.

Backup scripts structure:

Role backup

Main role for backup contains ansible tasks to run backups on cluster components.

Tasks:

  1. Elasticsearch & Kibana

    1.1. Create local location where snapshot will be stored: /tmp/snapshots 1.2. Update elasticsearch.yml file with backup location

     ```bash
     path.repo: ["/tmp/backup/elastic"]
     ```
    

    1.3. Reload configuration 1.4. Register repository:

    curl -X PUT "https://host_ip:9200/_snapshot/my_backup?pretty" \n
    -H 'Content-Type: application/json' -d '
    {
        "type": "fs",
        "settings": {
        "location": "/tmp/backup/elastic"
        }
    }
    '
    

    1.5. Take snapshot:

    curl -X GET "https://host_ip:9200/_snapshot/my_repository/1" \n 
    -H 'Content-Type: application/json'
    

    This command will create snapshot in location sent in step 1.2

    1.5. Backup restoration:

    curl -X POST "https://host_ip:9200/_snapshot/my_repository/2/_restore" -H 'Content-Type: application/json'
    

    Consider options described in opendistro documentation

    1.6. Backup configuration files:

    /etc/elasticsearch/elasticsearch.yml
    /etc/kibana/kibana.yml
    
  2. Monitoring

    2.1.1 Prometheus data

    Prometheus delivers solution to create data snapshot. Admin access is required to connect to application api with admin privileges. By default admin access is disabled, and needs to be enabled before snapshot creation. To enable admin access --web.enable-admin-api needs to be set up while starting service:

    service configuration:
    /etc/systemd/system/prometheus.service
    
    systemctl daemon-reload
    systemctl restart prometheus
    

    Snapshot creation:

    curl -XPOST http://localhost:9090/api/v1/admin/tsdb/snapshot
    

    By default snapshot is saved in data directory, which is configured in Prometheus service configuration file as flag:

    --storage.tsdb.path=/var/lib/prometheus
    

    Which means that snapshot directory is creted under:

    /var/lib/prometheus/snapshots/yyyymmddThhmmssZ-*
    

    After snapshot admin access throuh API should be reverted.

    Snapshot restoration process is just pointing --storage.tsdb.path parameter to snaphot location and restart Prometheus.

    2.1.2. Prometheus configuration

    Prometheus configurations are located in:

    /etc/prometheus
    

    2.2. Grafana backup and restore

    Copy files from grafana home folder do desired location and set up correct permissions:

    location: /var/lib/grafana
    content:
    - dashboards
    - grafana.db
    - plugins
    - png (contains renederes png images - not necessary to back up)
    

    2.3 Alert manager

    Configuration files are located in:

    /etc/prometheus
    

    File alertmanager.yml should be copied in step 2.1.2 if exists

  3. PostgreSQL

    3.1. Basically PostgreSQL delivers two main tools for backup creation: pg_dump and pg_dumpall

    pg_dump create dump of selected database:

    pg_dump dbname > dbname.bak
    

    pg_dumpall - create dump of all databases of a cluster into one script. This dumps also global objects that are common to all databases like: users, groups, tablespaces and properties such as access permissions (pg_dump does not save these objects)

    pg_dumpall > pg_backup.bak
    

    3.2. Database resotre: psql or pg_restore:

    psql < pg_backup.bak
    pgrestore -d dbname db_name.bak
    

    3.3. Copy configuration files:

    /etc/postgresql/10/main/* - configuration files
    .pgpass - authentication credentials
    
    
  4. RabbitMQ

    4.1. RabbitMQ definicions might be exported using API (rabbitmq_management plugins need to be enabled):

    rabbitmq-plugins enable rabbitmq_management
    curl -v -X GET http://localhost:15672/api/definitions -u guest:guest -H "content-type:application/json" -o json
    

    Import backed up definitions:

    curl -v -X POST http://localhost:15672/api/definitions -u guest:guest -H "content-type:application/json" --data backup.json
    

    or add backup location to configuration file and restart rabbitmq:

    management.load_definitions = /path/to/backup.json
    

    4.2 Backing up RabbitMQ messages To back up messages RabbitMQ must be stopped. Copy content of rabbitmq mnesia directory:

    RABBITMQ_MNESIA_BASE
    
    ubuntu:
    /var/lib/rabbitmq/mnesia
    

    Restoration: place these files to similar location

    4.3 Backing up configuration:

    Copy /etc/rabbitmq/rabbitmq.conf file

  5. HAProxy

Copy /etc/haproxy/ to backup location

Copy certificates stored in /etc/ssl/haproxy/ location.

4.2 - Cloud

Desgin docs for Cloud Backup

LambdaStack cloud backup design document

Affected version: 0.5.x

Goals

Provide backup functionality for LambdaStack - cluster created using lambdastack tool.

Use cases

Creating snapshots of disks for all elements in environment created on cloud.

Example use

lsbackup --disks-snapshot -f path_to_data_yaml

Where -f is path to data yaml file with configuration of environment. --disks-snapshot informs about option that will create whole disks snapshot.

Backup Component View

User/background service/job executes lsbackup (code name) application. Application takes parameters:

  • -f: path to data yaml file with configuration of environment.
  • --disks-snapshot: option to create whole disk snapshot

Tool when executed takes resource group from file provided with -f flag and create snapshots of all elements in resource group.

Tool also produces metadata file that describes backup with time and the name of disks for which snapshot has been created.

5 - Cache Storage

Desgin docs for Cache Storage

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack cache storage design document

Affected version: 0.4.x

Goals

Provide in-memory cache storage that will be capable of store large amount of data with hight performance.

Use cases

LambdaStack should provide cache storage for key-value stores, latest value taken from queue (Kafka).

Architectural decision

Considered options are:

  • Apache Ignite
  • Redis
Description Apache Ignite Redis
License Apache 2.0 three clause BSD license
Partition method Sharding Sharding
Replication Yes Control Plane-Node - yes, Control Plane - Control Plane - only enterprise version
Transaction concept ACID Optimistic lock
Data Grid Yes N/A
In-memory DB Distributed key-value store, in-memory distributed SQL database key-value store
Integration with RDBMS Can integrate with any relational DB that supports JDBC driver (Oracle, PostgreSQL, Microsoft SQL Server, and MySQL) Possible using 3rd party software
Integration with Kafka Using Streamer (Kafka Streamer, MQTT Streamer, ...) possible to insert to cache Required 3rd party service
Machine learning Apache Ignite Machine Learning - tools for building predictive ML models N/A

Based on above - Apache Ignite is not just scalable in-memory cache/database but cache and processing platform which can run transactional, analytical and streaming workloads. While Redis is simpler, Apache Ignite offers lot more features with Apache 2.0 licence.

Choice: Apache Ignite

Design proposal

[MVP] Add Ansible role to lambdastack that installs Apache Ignite and sets up cluster if there is more than one instance. Ansible playbook is also responsible for adding more nodes to existing cluster (scaling).

Possible problems while implementing Ignite clustering:

  • Ignite uses multicast for node discovery which is not supported on AWS. Ignite distribution comes with TcpDiscoveryS3IpFinder so S3-based discovery can be used.

To consider:

  • Deploy Apache Ignite cluster in Kubernetes

6 - CI/CD

Desgin docs for CI/CD

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

Comparision of CI/CD tools

Research of available solutions

After some research I found below tools. I group them by categories in columns:

name paid open source self hosted cloud hosted
jenkin-x 0 1 1 0
tekton 0 1 1 0
jenkins 0 1 1 0
gitlabCI 0 1 1 0
goCD 0 1 1 0
bazel 0 1 1 0
argoCD 0 1 1 0
spinnaker 0 1 1 0
buildBot 0 1 1 0
Travis 0 0 0 1
buddy 1 0 1 1
circleCI 1 0 1 1
TeamCity 1 0 1 1
CodeShip 1 0 0 1
azureDevOps 1 0 0 1
Bamboo 1 0 1 0

First for recognition goes only open source and free (at least in our usage model) tools.

Closer look on choosen tools

name paid open source self hosted cloud hosted comment
jenkins-x 0 1 1 0
tekton 0 1 1 0
jenkins 0 1 1 0
gitlabCi 0 1 1 0 requires use GitLab
goCD 0 1 1 0
argoCD 0 1 1 0 CD tool requie other CI tool
bazel 0 1 1 0 this is build engine not a build server
spinnaker 0 1 1 0 mostly used for CD purposes
buildBot 0 1 1 0 looks worse then previous tools
Travis 0/1 0 0 1 In our usage model we will have to pay

After closer look I consider this tools:

  • goCD
  • jenkins-x
  • tekton
  • jenkins
  • argoCD - this is CD tools so it's not compared in table below
  • spinnaker - wasn't tested because it is CD tools and we need also CI tool

Comparision

Run server on kubernetes

gocd: easily installed by helm chart, requires to be accesible from outside cluster if we want to access UI. Can be run on Linux systems also

jenkins: can be easily started on any cluster

jenkins-x: hard to set up on running cluster. I created new kubernetes cluster by their tool which generally is ok - but in my vision it will be good to use it on LambdaStack cluster (eat your own dog food vs drink your own champane). Many (probably all) services works based on DNS names so also I have to use public domain (use mine personal)

tekton: easily started on LambdaStack cluster.

Accesses

gocd: , OAuth, LDAP or internal database

jenkins: OIDC, LDAP, internal, etc.

jenkins-x: Jenkins X uses Role-Based Access Control (RBAC) policies to control access to its various resources

tekton: For building purposes there is small service which webhooks can connect and there predined pipeline is starting. For browsing purposes dashboard has no restrictions - it's open for everybody - this could be restricted by HAProxy or nginx. Only things you can do in dashbord is re-run pipeline or remove historical builds. Nothing more can be done.

Pipeline as a Code

gocd: possible and looks ok, pipeline code can be in different repository

jenkins: possible and looks ok

jenkins-x: possible looks ok (Tekton)

tekton: pipelines are CRD so can be only as a code

Build in pods

gocd: Elastic agent concepts. Can create many groups (probably on different clusters - not tested yet) and assigned them to proper pipelines

jenkins: plugin for building in kubernetes

jenkins-x: building in pods in cluster jenkins-x is installed. Possible to install many jenkins-x servers (according to documentation per each team in different namespace). Able to run in multi cluster mode

tekton: building in cluster easily. Not possible to build on different server - but I didn't any sence in that use case. Possible to deploy on other kubernetes service.

Secrets

gocd: Plugins for secrets from: hashicorp vault, kubernetes secrets, file based

jenkins: plugins for many options: hashicorp vault, kubernetes secrets, internal secrets, etc

jenkins-x: Providers for secrets from: hashicorp vault, kubernetes secrets

tekton: Use secrets from kubernetes so everything what is inside kubernetes can be read

Environment varaibles:

gocd: multiple level of variables: environment, pipeline, stage, job

jenkins: environment variables can be overriden

jenkins-x: Didn't find any information but expect it will not be worst than in gocd

tekton: You can read env variables from any config map so this is kind of overriding.

Plugins

gocd: not big number of plugins (but is this really bad?) but very of them really usefull (LDAP, running in pods, vault, k8s secrets, docker registry, push to S3, slack notification, etc)

jenkins: many plugins. But if there is too much of them they start making serious issues. Each plugin has different quality and each can breake the server and has its own security issues so we have to be very careful with them.

jenkins-x: plugins are called app. There are few of them and this app are helm charts. Jenkins-x uses embeded nexus, chartmuseum and monocular services. I don't know if the is option to get rid of them.

tekton: tekton itself is kind of plugin for building. You can create whatever you want in different pod and get it.

Personal conclusion

gocd:

  • This looks like really good CI/CD central server which can be use by many teams.
  • Really mature application. Older release on github from Nov 2014. According to wiki first release in 2007.
  • very intuitive
  • Working really good in kubernetes
  • Good granuality of permission.
  • Good documentation
  • Small amount of help in Internet (compare to jenkins)
  • Small community

GoCD can be easily set up for our organizations. Adding new customers should not be big deal. Working with is very intuitive - old school concept of CICD.

jenkins:

  • Production ready
  • The most search CI/CD tool in google - so almost every case is describe somwhere
  • Very simple
  • Working very good in kubernetes
  • After using it for some time pipelines are getting bigger and harder to maintain
  • Good granuality of permission
  • XML configuration for many plugins
  • Big amount of information in Internet
  • Big community

The most popular CI/CD tool. Small and simple. You can do everything as a code or by GUI - which is not good because it's temptation to fix it right now and then probably do not put it to repository. A lot of plugins which and each of them is single point of failure. Hard to configure some plugin as a code - but still possible.

jenkins-x:

  • There is new sheriff in town - new way of maintainig CICD server
  • New application still under heavy development (don't know what exactly but number of commits is really big)
  • New concept of CICD, a lot of magic doing under the hood, GitOps and ChatOps
  • Designed to work inside oif kubernetes
  • Still don't know how to manage permissions
  • Big community (CDFoundation is under Linux Foundation)

Jenkins-x is definetly new sheriff in town. But to enable it in big existing organization with new way of CICD process requires changing the way of thinking about all process. So it's really hot topic, but is it ok for us to pay that price.

tekton:

  • New concept of CI - serverless.
  • Tekton is young (first release 20 Feb 2019).
  • Is a part of jenkins-x so it's simpler when you starting playing with it and still you can configure everything as in jenkins-x by yourself.
  • Easy to install in LambdaStack cluster - kubernetes CRD
  • Easy to install triggers which allow to build when request is comming.
  • It should be separate namespace for every team. Builds will be running in one cluster using the same hosts.
  • No permission to dashboard. It has to be resolve by properly configure HAProxy or nginx in front of dashboard. Dashboard is running as kubernetes service.
  • Big community.
  • Smal but good enough help regarding tekton itself. Under the hood it's kubernetes so you can configure it as you want.

Comparing it previous solutions jenkins-x is using tekton. So it has less features then jenkins-x - and thanks to that is simpler - but by deafult I was not able to configure really usefull feature building on push. There is such possibility by running tekton triggers which is realy simple. This project is under CDFoundation and has a big community which is really good. My personal choice.

Another concept CI and CD tool

Use separate tools for Continious Integration and Continious Deployment. In this concept I recognized Tekton for building and ArgoCD for delivery purposes.

ArgoCD

In ArgoCD you can easily deploy one of your applications described as kubernetes resources into one of your kubernetes clusters. In that case recommended option is to have two repos one for code and one for configuration. Thanks to that you can easily separate code from configuration. It also works with one repo where you keep code and configuration in one repo.

When Argo detect changes in configuration it runs new configuration on cluster. It's simple like that.

User management

Possible to use: local users, SSO with Bundled Dex OIDC provider, SSO with Existing OIDC provider

Secrets

  • Bitnami Sealed Secrets
  • Godaddy Kubernetes External Secrets
  • Hashicorp Vault
  • Banzai Cloud Bank-Vaults
  • Helm Secrets
  • Kustomize secret generator plugins
  • aws-secret-operator
  • KSOPS

Conclusion

ArgoCD looks very good if you have a really big number of clusters you are managing. Thanks to that you can deploy whatever you want wherever you need. But this is needed for really for big scale.

7 - Command Line

Desgin docs for Command Line (CLI)

This directory contains design documents related to cli functionality itself.

7.1 - CLI

(Outdated) Needs updating - Desgin docs for CLI

LambdaStack CLI design document

Affected version: 0.2.1

Goals

Provide a simple to use CLI program that will:

  1. provide input validation (cmd arguments and data file)
  2. maintain LambdaStack cluster state (json file, binary, tbd)
  3. allow to create empty project (via command-line and data file)
  4. maintain information about LambdaStack version used on each machine (unique identifier generation?)
  5. allow to add/remove resources via data file.
    • separate infrastructure data files from configuration
    • internal file with default values will be created
  6. allow to add resources via command-line (networks, vpn, servers, roles, etc.)
  7. allow all messages from cli to be convertible to json/yaml (like -o yaml, -o json)
  8. plugable storage/vault for LambdaStack state and Terraform state

Use cases

CLI deployments/management usage

Create empty cluster:

> LambdaStack create cluster --name='lambdastack-first-cluster'

Add resources to cluster:

> LambdaStack add machine --create --azure --size='Standard_DS2_v2' --name='master-vm-hostname'
> LambdaStack add master -vm 'master-vm-hostname'
> ...

Read information about cluster:

> LambdaStack get cluster-info --name='lambdastack-first-cluster'

CLI arguments should override default values which will be provided almost for every aspect of the cluster.

Data driven deployments/management usage - Configuration and Infrastructure definition separation

While CLI usage will be good for ad-hoc operations, production environments should be created using data files.

Data required for creating infrastructure (like network, vm, disk creation) should be separated from configuration (Kubernetes, Kafka, etc.).

Each data file should include following header:

kind: configuration/component-name # configuration/kubernetes, configuration/kafka, configuration/monitoring, ...
version: X.Y.Z
title: my-component-configuration
specification:
    # ...

Many configuration files will be handled using --- document separator. Like:

kind: configuration/kubernetes
# ...
---
kind: configuration/kafka
# ...

Creating infrastructure will be similar but it will use another file kinds. It should look like:

kind: infrastructure/server
version: X.Y.Z
title: my-server-infra-specification
specification:
    # ...

One format to rule them all

Same as many configurations can be enclosed in one file with --- separator, configuration and infrastructure yamls should also be treated in that way.

Example:

kind: configuration/kubernetes
# ...
---
kind: configuration/kafka
# ...
---
kind: infrastructure/server
#...

Proposed design - Big Picture

LambdaStack engine architecture proposal

Input

LambdaStack engine console application will be able to handle configuration files and/or commands.

Commands and data files will be merged with default values into a model that from now on will be used for configuration. If data file (or command argument) will contain some values, those values should override defaults.

Infrastructure

Data file based on which the infrastructure will be created. Here user can define VMs, networks, disks, etc. or just specify a few required values and defaults will be used for the rest. Some of the values - like machine IPs (and probably some more) will have to be determined at runtime.

Configuration

Data file for cluster components (e.g. Kubernetes/Kafka/Prometheus configuration). Some of the values will have to be retrieved from the Infrastructure config.

State

The state will be a result of platform creation (aka build). It should be stored in configured location (storage, vault, directory). State will contain all documents that took part in platform creation.

7.2 - CLI UX

Desgin docs for CLI UX

LambdaStack CLI UX

Affected version: unknown

Goals

This document aim is to improve user experience with lambdastack tool with strong emphasis to lower entry level for new users. It provides idea for following scenarios:

  • lambdastack installation
  • environment initialization and deployment
  • environment component update
  • cli tool update
  • add component to existing environment

Assumptions

Following scenarios assume:

  • there is component version introduced - lambdastack version is separated from component version. It means that i.e. lambdastack v0.0.1 can provide component PostgreSQL 10.x and/or PostgreSQL 11.x.
  • there is server-side component - LambdaStack environment is always equipped with server side daemon component exposing some API to lambdastack.

Convention

I used square brackets with dots inside:

[...]

to indicate processing or some not important for this document output.

Story

lambdastack installation

To increase user base we need to provide brew formulae to allow simple installation.

> brew install lambdastack

environment initialization and deployment

init

As before user should be able to start interaction with lambdastack with lambdastack init command. In case of no parameters interactive version would be opened.

> lambdastack init 
What cloud provider do you want to use? (Azure, AWS): AWS
Is that a production environment? No
Do you want Single Node Kubernetes?: No
How many Kubernetes Control Planes do you want?: 1
How many Kubernetes Nodes do you want?: 2
Do you want PostgreSQL relational database?: Yes
Do you want RabbitMQ message broker?: No
Name your new LambdaStack environment: test1
There is already environment called test1, please provide another name: test2
[...]
Your new environment configuration was generated! Go ahead and type: 'lambdastack status' or 'lambdastack apply. 

It could also be lambdastack init -p aws -t nonprod -c postgresql .... or lambdastack --no-interactive -p aws for non-interactive run.

inspect .lambdastack/

Previous command generated files in ~/.lambdastack directory.

> ls –la ~/.lambdastack
config
environemts/
> ls –la ~/.lambdastack/environments/
test2/
> ls –la ~/.lambdastack/environments/test2/
test2.yaml
> cat ~/.lambdastack/config
version: v1
kind: Config 
preferences: {} 
environments: 
- environment: 
  name: test2 
    localStatus: initialized
    remoteStatus: unknown
users: 
- name: aws-admin 
contexts: 
- context: 
  name: test2-aws-admin
    user: aws-admin
    environment: test2
current-context: test2-admin

status after init

Output from lambdastack init asked to run lambdastack status.

> lambdastack status
Client Version: 0.5.3
Environment version: unknown
Environment: test2
User: aws-admin
Local status: initialized
Remote status: unknown
Cloud:
  Provider: AWS
  Region: eu-central-1
  Authorization: 
    Type: unknown
    State: unknown
Components: 
  Kubernetes: 
    Local status: initialized
    Remote status: unknown
    Nodes: ? (3)
    Version: 1.17.1
  PostgreSQL: 
    Local status: initialized
    Remote status: unknown
    Nodes: ? (1)
    Version: 11.2
---
You are not connected to your environment. Please type 'lambdastack init cloud' to provide authorization informations!    

As output is saying for now this command only uses local files in ~/.lambdastack directory.

init cloud

Follow instructions to provide cloud provider authentication.

> lambdastack init cloud
Provide AWS API Key: HD876KDKJH9KJDHSK26KJDH 
Provide AWS API Secret: ***********************************
[...]
Credentials are correct! Type 'lambdastack status' to check environment. 

Or in non-interactive mode something like: lambdastack init cloud -k HD876KDKJH9KJDHSK26KJDH -s dhakjhsdaiu29du2h9uhd2992hd9hu.

status after init cloud

Follow instructions.

> lambdastack status 
Client Version: 0.5.3
Environment version: unknown 
Environment: test2 
User: aws-admin 
Local status: initialized 
Remote status: unknown 
Cloud: 
  Provider: AWS 
  Region: eu-central-1 
  Authorization:  
    Type: key-secret
    State: OK
Components:  
  Kubernetes:  
    Local status: initialized 
    Remote status: unknown 
    Nodes: ? (3) 
    Version: 1.17.1 
  PostgreSQL:  
    Local status: initialized 
    Remote status: unknown 
    Nodes: ? (1) 
    Version: 11.2  
--- 
Remote status is unknown! Please type 'lambdastack status update' to synchronize status with remote. 

status update

As lambdastack was able to connect to cloud but it doesn't know remote state it asked to update state.

> lambdastack status update
[...]
Remote status updated!
> lambdastack status 
Client Version: 0.5.3
Environment version: unknown 
Environment: test2 
User: aws-admin 
Local status: initialized 
Remote status: uninitialized
Cloud: 
  Provider: AWS 
  Region: eu-central-1 
  Authorization:  
    Type: key-secret
    State: OK
Components:  
  Kubernetes:  
    Local status: initialized 
    Remote status: uninitialized
    Nodes: 0 (3) 
    Version: 1.17.1 
  PostgreSQL:  
    Local status: initialized 
    Remote status: uninitialized
    Nodes: 0 (1) 
    Version: 11.2 
--- 
Your cluster is uninitialized. Please type 'lambdastack apply' to start cluster setup. 
Please type 'lambdastack status update' to synchronize status with remote.

  It connected to cloud provider and checked that there is no cluster.

apply

> lambdastack apply
[...]
---
Environment 'test2' was initialized successfully! Plese type 'lambdastack status' to see status or 'lambdastack components' to list components. To login to kubernetes cluster as root please type 'lambdastack components kubernetes login'. 
Command 'lambdastack status' will synchronize every time now, so no need to run 'lambdastack status update'

lambdastack knows now that there is cluster and it will connect for status every time user types lambdastack status unless some additional preferences are used.

status after apply

Now it connects to cluster to check status. That relates to assumption from the beginning of this document that there is some server-side component providing status. Other way lambdastack status would have to call multiple services for status.

> lambdastack status 
[...]
Client Version: 0.5.3
Environment version: 0.5.3
Environment: test2 
User: aws-admin 
Status: OK
Cloud: 
  Provider: AWS 
  Region: eu-central-1 
  Authorization:  
    Type: key-secret
    State: OK
Components:  
  Kubernetes:  
    Status: OK
    Nodes: 3 (3)
    Version: 1.17.1 
  PostgreSQL:  
    Status: OK
    Nodes: 1 (1) 
    Version: 11.2  
--- 
Your cluster is fully operational! Plese type 'lambdastack components' to list components. To login to kubernetes cluster as root please type 'lambdastack components kubernetes login'.

kubernetes login

> lambdastack components kubernetes login
[...]
You can now operate your kubernetes cluster via 'kubectl' command! 

Content is added to ~/.kube/config file. To be agreed how to do it.

> kubectl get nodes
[...]

components

RabbitMQ is here on the list but with “-“ because it is not installed.

> lambdastack components
[...]
+kubernetes
+postgresql
- rabbitmq

component status

> lambdastack components kubernetes status
[...]
Status: OK 
Nodes: 3 (3) 
Version: 1.17.1 (current)  
Running containers: 12
Dashboard: http://12.13.14.15:8008/ 

environment component update

3 months passed and new version of LambdaStack component was released. There is no need to update client and there is no need to update all components at once. Every component is upgradable separately.

component status

lambdastack status command will notify user that there is new component version available.

> lambdastack components kubernetes status
[...]
Status: OK 
Nodes: 3 (3) 
Version: 1.17.1 (outdated)  
Running containers: 73
Dashboard: http://12.13.14.15:8008/
---
Run 'lambdastack components kubernetes update' to update to 1.18.1 version! Use '--dry-run' flag to check update plan. 

component update

> lambdastack components kubernetes update
[...]
Kubernetes was successfully updated from version 1.17.1 to 1.18.1! 

It means that it updated ONLY one component. User could probably write something like lambdastack components update or even lambdastack update but there is no need to go all in, if one does not want to.

cli tool update  

User typed brew update in and lambdastack was updated to newest version.

status

> lambdastack status 
[...]
Client Version: 0.7.0
Environment version: 0.5.3
Environment: test2 
User: aws-admin 
Status: OK
Cloud: 
  Provider: AWS 
  Region: eu-central-1 
  Authorization:  
    Type: key-secret
    State: OK
Components:  
  Kubernetes:  
    Status: OK
    Nodes: 3 (3)
    Version: 1.18.1 
  PostgreSQL:  
    Status: OK
    Nodes: 1 (1) 
    Version: 11.2  
--- 
Your cluster is fully operational! Plese type “lambdastack components” to list components. To login to kubernetes cluster as root please type “lambdastack components kubernetes login”.
Your client version is newer than environment version. You might consider updating environment metadata to newest version. Read more at https://lambdastack.github.io/environment-version-update. 

It means that there is some metadata on cluster with information that it was created and governed with lambdastack version 0.5.3 but new version of lambdastack binary can still communicate with environment.

add component to existing environment

There is already existing environment and we want to add new component to it.

component init

> lambdastack components rabbitmq init
[...]
RabbitMQ config was added to your local configuration. Please type “lambdastack apply” to apply changes. 

Component configuration files were generated in .lambdastack directory. Changes are still not applied.

apply

> lambdastack apply
[...]
---
Environment “test2” was updated! Plese type “lambdastack status” to see status or “lambdastack components” to list components. To login to kubernetes cluster as root please type “lambdastack components kubernetes login”. 
Command “lambdastack status” will synchronize every time now, so no need to run “lambdastack status update”

Daemon

We should also consider scenario with web browser management tool. It might look like:

> lambdastack web
open http://127.0.0.1:8080 to play with environments configuration. Type Ctrl-C to finish ...
[...]

User would be able to access tool via web browser based UI to operate it even easier.

Context switching

Content of ~/.lambdastack directory indicates that if user types lambdastack init -n test3 there will be additional content generated and user will be able to do something like lambdastack context use test3 and lambdastack context use test2.

7.3 -

This directory contains design documents related to cli functionality itself.

8 - Harbor Registry

Desgin docs for Harbor Registry

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

Docker Registry implementation design document

Goals

Provide Docker container registry as a LambdaStack service. Registry for application containers storage, docker image signs and docker image security scanning.

Use cases

Store application docker images in private registry. Sign docker images with passphrase to be trusted. Automated security scanning of docker images which are pushed to the registry.

Architectural decision

Comparison of the available solutions

Considered options:

Feature comparison table

Feature Harbor Quay.io Portus
Ability to Determine Version of Binaries in Container Yes Yes Yes
Audit Logs Yes Yes Yes
Content Trust and Validation Yes Yes Yes
Custom TLS Certificates Yes Yes Yes
Helm Chart Repository Manager Yes Partial Yes
Open source Yes Partial Yes
Project Quotas (by image count & storage consumption) Yes No No
Replication between instances Yes Yes Yes
Replication between non-instances Yes Yes No
Robot Accounts for Helm Charts Yes No Yes
Robot Accounts for Images Yes Yes Yes
Tag Retention Policy Yes Partial No
Vulnerability Scanning & Monitoring Yes Yes Yes
Vulnerability Scanning Plugin Framework Yes Yes No
Vulnerability Whitelisting Yes No No
Complexity of the installation process Easy Difficult Difficult
Complexity of the upgrade process Medium Difficult Difficult

Source of comparison: https://goharbor.io/docs/1.10/build-customize-contribute/registry-landscape/ and also based on own experience (stack installation and upgrade).

Design proposal

Harbor services architecture

enter image description here

Implementation architecture

Additional components are required for Harbor implementation.

  • Shared storage volume between kubernetes nodes (in example NFS),
  • Component for TLS/SSL certificate request (maybe cert-manager?),
  • Component for TLS/SSL certificate store and manage certificate validation (maybe Vault?),
  • Component for TLS/SSL certificate share between server and client (maybe Vault?).
  • HELM component for deployment procedure.

Diagram for TLS certificate management:

enter image description here

Kubernetes deployment diagram:

enter image description here

Implementation steps

  • Deploy shared storage service (in example NFS) for K8s cluster (M/L)
  • Deploy Helm3 package manager and also Helm Charts for offline installation (S/M)
  • Deploy Hashicorp Vault for self-signed PKI for Harbor (external task + S for Harbor configuration)
  • Deploy "cert request/management" service and integrate with Hashicorp Vault - require research (M/L)
  • Deploy Harbor services using Helm3 with self-signed TLS certs (for non-production environments) (L)
  • Deploy Harbor services using Helm3 with commercial TLS certs (for prod environments) (M/L)

9 - Health Monitor

Desgin docs for Health Monitor

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Health Monitor service design proposal

Affected version: 0.6.x/0.7.x

Goals

Provide service that will be monitoring components (Kubernetes, Docker, Kafka, EFK, Prometheus, etc.) deployed using LambdaStack.

Use cases

Service will be installed and used on Virtual Machines/Bare Metal on Ubuntu and RedHat (systemd service). Health Monitor will check status of components that were installed on the cluster. Combinations of those components can be different and will be provided to the service through configuration file.

Components that Health Monitor should check:

  • Kubernetes (kubelet)*
  • Query Kubernetes health endpoint (/healthz)*
  • Docker*
  • Query Docker stats*
  • PostgreSQL
  • HAProxy
  • Prometheus
  • Kafka
  • ZooKeeper
  • ElasticSearch
  • RabbitMQ

* means MVP version.

Health Monitor exposes endpoint that is compliant with Prometheus metrics format and serves data about health checks. This endpoint should listen on the configurable port (default 98XX).

Design proposal

TODO

10 - Infrastructure

Desgin docs for Infrastructure

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

Cloud resources naming convention

This document describes recommendations how to name infrastructure resources that are usually created by Terraform. Unifying resource names allows easily identify and search for any resource even if no specific tags were provided.

Listed points are based on development of LambdaStack modules and best practices provided by Microsoft Azure.

In general resource name should match following schema:

<prefix>-<resource_type>-<index>

Prefix

LambdaStack modules are developed in the way that allows user specify a prefix for created resources. This approach gives such benefits as ordered sorting and identifying who is the owner of the resource. Prefix can include following parts with a dash - as a delimiter.

Type Required Description Examples
Owner yes The name of the person or team which resource belongs to LambdaStack
Application or service name no Name of the application, workload, or service that the resource is a part of kafka, ignite, opendistro
Environment no The stage of the development lifecycle for the workload that the resource supports prod, dev, qa
VM group no The name of VM group that resource is created for group-0

Resource type

Resource type is a short name of resource that is going to be created. Examples:

  • rg: resource group
  • nsg: network security group
  • rt-private: route table for private networking

Index

Index is a serial number of the resource. If single resource is created, 0 is used as a value.

11 - Kubernetes/Vault Integration

Desgin docs for Kubernetes and Vault integration

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Kubernetes with Hashicorp Vault integration

Affected version: 0.7.x

1. Introduction

We want to provide integration of Kubernetes with Hashicorp Vault with couple of different modes:

  1. vault - prod/dev mode without https
  2. vault - prod/dev mode with https
  3. vault - cluster with raft storage

We are not providing vault in vault development mode as this doesn't provide data persitency.

If user would like then can use automatic injecting of secrets into Kubernetes pods with usage of sidecar integration provided by Hashicorp Vault agent. Sidecar will based on annotations for pods inject secrets as files to annotated pods.

2. Goal

In LambdaStack you can use Kubernetes secrets stored in etcd. We want to provide integration with Hashicorp Vault to provide additional security for secrets used inside applications running in LambdaStack and also provide possibilty of usage safely secrets for components that are running outside of Kubernetes cluster.

3. Design proposals

In all deployment models vault is installed outside Kubernetes cluster as a separate service. There is a possibility of usage Hashicorp Vault deployed on Kubernetes cluster but this scenario is not covered in this document.

Integration between Kubernetes and Hashicorp Vault can be achieved via Hashicorp Vault Agent that is deployed on Kubernetes cluster using Helm. Also to provide this Hashicorp Vault needs to be configured with proper policies and enabling kubernetes method of authentication.

Kubernetes Vault Integration

In every mode we want to provide possibility to perform automatic unseal via script, but this solution is better suited for development scenario. In production however to maximize security level unseal should be performed manually.

In all scenarios machine on which Hashicorp Vault will be running swap will be disabled and Hashicorp Vault will run under user with limited privileges (e.g. vault). User under which Hashicorp Vault will be running will have ability to use the mlock syscall In configuration from LambdaStack side we want to provide possibility to turn off dumps at the system level (turned off by default), use auditing (turned on by default), expose UI (by default set to disabled) and disable root token after configuration (by default root token will be disabled after deployment).

We want to provide three scenarios of installing Hashicorp Vault:

  1. vault - prod/dev mode without https
  2. vault - prod/dev mode with https
  3. vault - cluster with raft storage

1. vault - prod/dev mode without https

In this scenario we want to use file storage for secrets. Vault can be set to manual or automatic unseal with script. In automatic unseal mode file with unseal keys is stored in file in safe location with permission to read only by vault user. In case of manual unseal vault post-deployment configuration script needs to be executed against vault. Vault is installed as a service managed by systemd. Traffic in this scenario is served via http, which make possible to perform man in the middle attacks, so this option should be only used in development scenarios.

2. vault - prod/dev mode with https

This scenario differs from previous with usage of https. In this scenario we should cover also generation of keys with usage of PKI, to provide certificate and mutual trust between the endpoints.

3. vault - cluster with raft storage

In this scenario we want to use raft storage for secrets. Raft storage is used for cluster setup and doesn't require additional Consul component what makes configuration easier and requires less maintenance. It also limit network traffic and increase performance. In this scenario we can also implement auto-unseal provided with Transient Secrets from Hashicorp Vault.

In this scenario at least 3 nodes are required, but preferable is 5 nodes setup to provide quorum for raft protocol. This can cover http and also https traffic.

4. Further extensions

We can provide additional components for vault unsealing - like integration with pgp keys to encrypt services and auto-unsealing with Transient Secrets from Hashicorp Vault. We can also add integration with Prometheus to share statistics with it.

12 - Kafka Authentication

Desgin docs for Kafka Authentication

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Kafka authentication design document

Affected version: 0.5.x

Goals

Provide authentication for Kafka clients and brokers using: 1). SSL 2). SASL-SCRAM

Use cases

1). SSL - Kafka will be authorizing clients based on certificate, where certificate will be signed by common CA root certificate and validated against . 2). SASL-SCRAM - Kafka will be authorizing clients based on credentials and validated using SASL and with SCRAM credentials stored in Zookeeper

Design proposal

Add to LambdaStack configuration/kafka field that will select authentication method - SSL or SASL with SCRAM. Based on this method of authentication will be selected with available settings (e.g. number of iterations for SCRAM).

For SSL option CA certificate will be fetched to machine where LambdaStack has been executed, so the user can sign his client certificates with CA certificate and use them to connect to Kafka.

For SASL with SCRAM option LambdaStack can create also additional SCRAM credentials creations, that will be used for client authentication.

13 - Kafka Monitoring Tools

Desgin docs for Kafka Monitoring Tools

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

KAFKA MONITORING TOOLS - RESEARCH RESULTS

  • Commercial feature, only trial version for free
  • Out of the box UI
  • Managing and monitoring Kafka cluster (including view consumer offset)
  • Possibility to set up alerts
  • Detailed documentation, lots of tutorials, blog articles and a wide community
  • All-in-one solution with additional features through Confluent Platform/Cloud
  • Commercial feature, only trial version for free
  • Out of the box UI
  • Deliver monitoring of Kafka data pipelines
  • Managing and monitoring Kafka cluster (including view consumer offset)
  • Possibility to set up alerts
  • Smaller community, fewer articles and tutorials around Lenses compared to the Control Center
  • Commercial feature, only trial version for free
  • ChatOps integrations
  • Out of the box UI
  • Built-in anomaly detection, threshold, and heartbeat alerts
  • Managing and monitoring Kafka cluster (including view consumer offset)
  • Possibility to set up alerts
  • Commercial feature, only trial version for free
  • Out of the box Kafka monitoring dashboards
  • Monitoring tool (including view consumer offset). Displays key metrics for Kafka brokers, producers, consumers and Apache Zookeeper. Less focused on cluster state
  • Possibility to set up alerts
  • Commercial feature, only trial version for free
  • Less rich monitoring tool compared to Confluent, Lenses and Datadog but is very convenient for companies that are already customers of Cloudera and need their monitoring mechanisms under the same platform
  • Commercial feature, only trial version for free
  • Out of the box UI
  • Monitoring tool (including view consumer offset)
  • Poor documentation
  • In latest changelogs, only support for kafka 2.1 mentioned
  • Some of opensource projects looks much more better that this one
  • Commercial feature, only trial version for free
  • Out of the box UI
  • Focused on filtering the messages within the topics and the creation of custom views
  • No possibility to set up alerts
  • Focuses more on business monitoring than on technical monitoring like Control Center or Lenses
  • KaDeck could be used in addition to the other monitoring tools
  • Opensource project, Apache-2.0 License
  • Managing and monitoring Kafka cluster (including view consumer offset)
  • Out of the box UI
  • No possibility to set up alerts
  • Opensource project, BSD 2-Clause "Simplified" License
  • Managing and monitoring Kafka cluster (not possible to view consumer offset :warning:)
  • Possible to track resource utilization for brokers, topics, and partitions, query cluster state, to view the status of partitions, to monitor server capacity (i.e. CPU, network IO, etc.)
  • Anomaly Detection and self-healing and rebalancing
  • No possibility to set up alerts
  • UI as seperated component link
  • It can use metrics reporter from LinkedIn (necessary to add jar file to kafka lib directory) but it is also possible to uses Prometheus for metric aggregation
  • Opensource project, Apache-2.0 License
  • Provides consumer lag checking as a service without the need for specifying thresholds. It monitors committed offsets for all consumers and calculates the status of those consumers on demand
  • It does not monitor anything related to the health of the brokers
  • Possibility to set up alerts
  • Opensource project, Apache-2.0 License, reboot of Kafdrop 2.x
  • Monitoring tool (including view consumer offset)
  • Out of the box UI
  • No possibility to set up alerts
  • Opensource project, Apache-2.0 License
  • Kafka monitor is a framework to implement and execute long-running Kafka system tests in a real cluster
  • It plays a role as a passive observer and reports what it observes (broker availability, produce/consume latency, etc) by emitting metrics. In other words, it pretends to be a Kafka user and keeps reporting metrics from the user's PoV
  • It is more a load generation and reporting tool
  • UI does not exist
  • No possibility to set up alerts

13. OTHERS

Things like on the list below are there as well, but usually such smaller projects and have little or no development activity:

14. CONCLUSIONS

Currently in LambdaStack monitoring and getting metrics from Kafka are based on:

In real scenarios, based on some use cases and opinions from internal teams:

  • Kafka Exporter is used in order to get consumer offset and lag
  • JMX Exporter is used in order to get some standard broker's metrics such as cpu, memory utilization and so on

If it is possible to pay for a commercial license, Confluent, Lenses and Sematext offer more rich functionality compared to the other monitoring tools and they are very similar.

As far as the open source project is considered:

  • LinkedIn Cruise Control looks like the winner. Provides not only managing and monitoring Kafka cluster but also some extra features such as rebalancing, anomaly detection or self-healing
  • Yahoo Cluster Manager looks like good competitor but only for managing and monitoring Kafka cluster. However in compare to Cruise Control, during the installation some issues were met and it was not able to recieve some consumer data and a few issues are already reported in official repository related to the problem link. The project does not have good spirit of open source software at all.
  • LinkedIn Burrow looks like good additional tool for LinkedIn Cruise Control when it comes to consumer lag checking service and can be used instead of Kafka exporter plugin which cause some outstanding issues

14 - Kubernetes HA

Desgin docs for Kubernetes HA

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Kubernetes HA design document

Affected version: 0.6.x

1. Goals

Provide highly-available control-plane version of Kubernetes.

2. Cluster components

2.1 Load balancer

2.1.1 External

Kubernetes HA cluster needs single TCP load-balancer to communicate from nodes to masters and from masters to masters (all internal communication has to go through the load-balancer).

Kubernetes HA - external LB

PROS:

  • standard solution

CONS:

  • it's not enough just to create one instance of such load-balancer, it needs failover logic (like virtual IP), so in the end for fully highly-available setup we need automation for whole new service
  • requires additional dedicated virtual machines (at least 2 for HA) even in the case of single-control-plane cluster
  • probably requires infrastructure that can handle virtual IP (depending on the solution for failover)

2.1.2 Internal

Following the idea from kubespray's HA-mode we can skip creation of dedicated external load-balancer (2.1.1).

Instead, we can create identical instances of lightweight load-balancer (like HAProxy) on each master and each kubelet node.

Kubernetes HA - internal LB

PROS:

  • no need for creation of dedicated load-balancer clusters with failover logic
  • since we could say that internal load-balancer is replicated, it seems to be highly-available by definition

CONS:

  • increased network traffic
  • longer provisioning times as (in case of any changes in load-balancer's configs) provisioning needs to touch every node in the cluster (master and kubelet node)
  • debugging load-balancer issues may become slightly harder

2.2 Etcd cluster

2.2.1 External

Kubernetes HA - external ETCD

PROS:

  • in the case of high network / system load external etcd cluster deployed on dedicated premium quality virtual machines will behave more stable

CONS:

  • requires automation for creation and distribution of etcd's server and client PKI certificates
  • upgrading etcd is difficult and requires well-tested autmation that works on multiple nodes at once in perfect coordination - in the case when etcd's quorum fails, it is unable to auto-heal itself and it requires to be reconstructed from scratch (where data loss or discrepancy seems to be likely)

2.2.2 Internal

Kubernetes HA - internal ETCD

PROS:

  • adding / removing etcd nodes is completely automated and behaves as expected (via kubeadm)
  • etcd's PKI is automatically re-distributed during joining new masters to control-plane

CONS:

  • etcd is deployed in containers alongside other internal components, which may impact its stability when system / network load is high
  • since etcd is containerized it may be prone to docker-related issues

3. Legacy single-master solution

After HA logic is implemented, it is probably better to reuse new codebase also for single-master clusters.

In the case of using internal load-balancer (2.1.2) it makes sense to use scaled-down (to single node) HA cluster (with single-backended load-balancer) and drop legacy code.

4. Use cases

The LambdaStack delivers highly-available Kubernetes clusters deploying them across multiple availability zones / regions to increase stability of production environments.

5. Example use

kind: lambdastack-cluster
title: "LambdaStack Cluster Config"
provider: any
name: "k8s1"
build_path: # Dynamically built
specification:
  name: k8s1
  admin_user:
    name: ubuntu
    key_path: id_ed25519
    path: # Dynamically built
  components:
    kubernetes_master:
      count: 3
      machines:
        - default-k8s-master1
        - default-k8s-master2
        - default-k8s-master3
    kubernetes_node:
      count: 2
      machines:
        - default-k8s-node1
        - default-k8s-node2
    logging:
      count: 0
    monitoring:
      count: 0
    kafka:
      count: 0
    postgresql:
      count: 0
    load_balancer:
      count: 0
    rabbitmq:
      count: 0
---
kind: infrastructure/machine
provider: any
name: default-k8s-master1
specification:
  hostname: k1m1
  ip: 10.10.1.148
---
kind: infrastructure/machine
provider: any
name: default-k8s-master2
specification:
  hostname: k1m2
  ip: 10.10.2.129
---
kind: infrastructure/machine
provider: any
name: default-k8s-master3
specification:
  hostname: k1m3
  ip: 10.10.3.16
---
kind: infrastructure/machine
provider: any
name: default-k8s-node1
specification:
  hostname: k1c1
  ip: 10.10.1.208
---
kind: infrastructure/machine
provider: any
name: default-k8s-node2
specification:
  hostname: k1c2
  ip: 10.10.2.168

6. Design proposal

As for the design proposal, the simplest solution is to take internal load-balancer (2.1.2) and internal etcd (2.2.2) and merge them together, then carefully observe and tune network traffic comming from haproxy instances for big number of worker nodes.

Kubernetes HA - internal LB

Example HAProxy config:

global
    log /dev/log local0
    log /dev/log local1 notice
    daemon

defaults
    log global
    retries 3
    maxconn 2000
    timeout connect 5s
    timeout client 120s
    timeout server 120s

frontend k8s
    mode tcp
    bind 0.0.0.0:3446
    default_backend k8s

backend k8s
    mode tcp
    balance roundrobin
    option tcp-check

    server k1m1 10.10.1.148:6443 check port 6443
    server k1m2 10.10.2.129:6443 check port 6443
    server k1m3 10.10.3.16:6443 check port 6443

Example ClusterConfiguration:

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: v1.14.6
controlPlaneEndpoint: "localhost:3446"
apiServer:
  extraArgs: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
    audit-log-maxbackup: "10"
    audit-log-maxsize: "200"
    audit-log-path: "/var/log/apiserver/audit.log"
    enable-admission-plugins: "AlwaysPullImages,DenyEscalatingExec,NamespaceLifecycle,ServiceAccount,NodeRestriction"
    profiling: "False"
controllerManager:
  extraArgs: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
    profiling: "False"
    terminated-pod-gc-threshold: "200"
scheduler:
  extraArgs: # https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
    profiling: "False"
networking:
  dnsDomain: cluster.local
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/12
certificatesDir: /etc/kubernetes/pki

To deploy first master run (Kubernetes 1.14):

$ sudo kubeadm init --config /etc/kubernetes/kubeadm-config.yml --experimental-upload-certs

To add one more master run (Kubernetes 1.14):

$ sudo kubeadm join localhost:3446 \
         --token 932b4p.n6teb53a6pd1rinq \
         --discovery-token-ca-cert-hash sha256:bafb8972fe97c2ef84c6ac3efd86fdfd76207cab9439f2adbc4b53cd9b8860e6 \
         --experimental-control-plane --certificate-key f1d2de1e5316233c078198a610c117c65e4e45726150d63e68ff15915ea8574a

To remove one master run (it will properly cleanup config inside Kubernetes - do not use kubectl delete node):

$ sudo kubeadm reset --force

In later versions (Kubernetes 1.17) this feature became stable and "experimental" word in the commandline paremeters was removed.

7. Post-implementation erratum

15 - Leader Election Pod

Desgin docs for Leader Election Pod

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

Leader election in Kubernetes

Components of control plane such as controller manager or scheduler use endpoints to select the leader. Instance which firstly create the endpoint of this service at the very beginning add annotation to the endpoint with the leader information.

Package leaderelection.go is used for leader election process which leverages above Kubernetes endpoint resource as some sort of LOCK primitive to prevent any follower to create the same endpoint in this same Namespace.

Leader election for pods

As far as leader election for pods is considered there are possible a few solutions:

  1. Since Kubernetes introduced in 1.14 version (March, 2019) coordination.k8s.io group API, it is possible to create in the cluster lease object which can hold the lock for the set of pods. It is necessary to implement a simple code into the application using package leaderelection.go in order to handle the leader election mechanism.

Helpful article:

This is the recommended solution, simple, based on existing API group and lease object and not dependent on any external cloud object.

  1. Kubernetes already uses Endpoints to represent a replicated set of pods so it is possible to use the same object for the purposes. It is possible to use already existing leader election framework from Kubernetes which implement simple mechanism. It is necessary to run leader-election container as sidecar container for replication set of application pods. Using the leader-election sidecar container, endpoint will be created which will be responsible for locking leader for one pod. Thanks to that, creating deployment with 3 pods, only one container with application will be in ready state - the one that works inside the pod leader. For application container, it is necessary to add readiness probe to the sidecar container:

Helpful article:

This solution was recommended by Kubernetes in 2016 and looks a little bit outdated, is complex and require some work.

  1. Microsoft and Google come up with a proposal to use cloud native storage with single object that contain the leader data but it requires to read that file by each node what can be in some situations problematic.

Helpful articles:

It is not recommended solution since the single object is a potential single point of failure.

16 - Modularization

Desgin docs for Modularization

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

This directory contains design documents related to modularization of LambdaStack.

16.1 -

Basic Infra ModuleS VS LambdaStack Infra

Basic overview

This represents the current status on: 05-25-2021

:heavy_check_mark: : Available :x: : Not available :heavy_exclamation_mark: Check the notes

LambdaStack Azure LambdaStack AWS Azure BI AWS BI
Network Virtual network :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:
Private subnets :heavy_exclamation_mark: :heavy_exclamation_mark: :heavy_check_mark: :heavy_check_mark:
Public subnets :heavy_exclamation_mark: :heavy_exclamation_mark: :heavy_check_mark: :heavy_check_mark:
Security groups with rules :heavy_check_mark: :heavy_check_mark: :x: :heavy_check_mark:
Possibility for Bastion host :x: :x: :heavy_check_mark: :heavy_check_mark:
Possibility to connect to other infra (EKS, AKS) :x: :x: :heavy_check_mark: :heavy_check_mark:
VM "Groups" with similar configuration :heavy_check_mark: :heavy_exclamation_mark: :heavy_check_mark: :heavy_check_mark:
Data disks :x: :x: :heavy_check_mark: :heavy_check_mark:
Shared storage (Azure Files, EFS) :heavy_check_mark: :heavy_check_mark: :x: :x:
Easy configuration :heavy_check_mark: :heavy_check_mark: :x: :x:

Notes

  • On LambdaStack AWS/Azure infrastructure we can either have a cluster with private or public subnets. As public IP`s can only be applied cluster wide and not on a VM "group" basis.
  • On LambdaStack AWS we use Auto Scaling Groups to represent groups of similar VM`s. This approach however has lots of issues when it comes to scaling the group/component.

Missing for Modules

  1. Currently, the Azure BI module does not have a way to implement security groups per subnets with rules configuration. An issue already exists for that here.
  2. Both BI modules currently only gives a default configuration, which makes it hard to create a full component layout for a full cluster.

16.2 -

Context

This design document presents findings on what are important pieces of modules communication in Dockerized Custom Modules approach described here.

Plan

Idea is to have something running and working mimicking real world modules. I used GNU make to perform this. With GNU make I was able to easily implement “run” logic. I also wanted to package everything into docker images to experience real world containers limitations of communication, work directory sharing and other stuff.

Dependencies problem

First list of modules is presented here:

version: v1
kind: Repository
components:
- name: c1
  type: docker
  versions:
  - version: 0.1.0
    latest: true
    image: "docker.io/hashicorp/terraform:0.12.28"
    workdir: "/terraform"
    mounts: 
    - "/terraform"
    commands:
    - name: init
      description: "initializes terraform in local directory"
      command: init
      envs:
        TF_LOG: WARN
    - name: apply
      description: "applies terraform in local directory"
      command: apply
      envs:
        TF_LOG: DEBUG
      args:
      - -auto-approve

... didn't have any dependencies section. We know that some kind of dependencies will be required very soon. I created idea of how to define dependencies between modules in following mind map:

mm

It shows following things:

  • every module has some set of labels. I don't think we need to have any "obligatory" labels. If you create very custom ones you will be very hard to find.
  • module has requires section with possible subsections strong and weak. A strong requirement is one has to be fulfilled for the module to be applied. A weak requirement, on the other hand, is something we can proceed without, but it is in some way connected when present.

It's worth co notice each requires rule. I used kubernetes matchExpressions approach as main way of defining dependencies, as one of main usage here would be "version >= X", and we cannot use simple labels matching mechanism without being forced to update all modules using my module every time I release a new version of that module.

Influences

I started to implement example docker based mocked modules in tests directory, and I found a 3rd section required: influences. To explain this lets notice one folded module in upper picture: "BareMetalMonitoring". It is Prometheus based module so, as it works in pull mode, it needs to know about addresses of machines it should monitor. Let's imagine following scenario:

  • I have Prometheus already installed, and it knows about IP1, IP2 and IP3 machines to be monitored,
  • in next step I install, let's say BareMetalKafka module,
  • so now, I want Prometheus to monitor Kafka machines as well,
  • so, I need BareMetalKafka module to "inform" in some way BareMetalMonitoring module to monitor IP4, IP5 and IP6 addresses to addition of what it monitors already.

This example explains "influences" section. Mocked example is following:

labels:
  version: 0.0.1
  name: Bare Metal Kafka
  short: BMK
  kind: stream-processor
  core-technology: apache-kafka
  provides-kafka: 2.5.1
  provides-zookeeper: 3.5.8
requires:
  strong:
    - - key: kind
        operator: eq
        values: [infrastructure]
      - key: provider,
        operator: in,
        values:
          - azure
          - aws
  weak:
    - - key: kind
        operator: eq
        values:
          - logs-storage
    - - key: kind
        operator: eq
        values:
          - monitoring
      - key: core-technology
        operator: eq
        values:
          - prometheus
influences:
  - - key: kind
      operator: eq
      values:
        - monitoring

As presented there is influences section notifying that "there is something what that I'll do to selected module (if it's present)". I do not feel urge to define it more strictly at this point in time before development. I know that this kind of influences section will be required, but I do not know exactly how it will end up.

Results

During implementation of mocks I found that:

  • influences section would be required
  • name of method validate-config (or later just validate) should in fact be plan
  • there is no need to implement method get-state in module container provider as state will be local and shared between modules. In fact some state related operations would be probably implemented on cli wrapper level.
  • instead, there is need of audit method which would be extremely important to check if no manual changes were applied to remote infrastructure

Required methods

As already described there would be 5 main methods required to be implemented by module provider. Those are described in next sections.

Metadata

That is simple method to display static YAML/JSON (or any kind of structured data) information about the module. In fact information from this method should be exactly the same to what is in repo file section about this module. Example output of metadata method might be:

labels:
  version: 0.0.1
  name: Bare Metal Kafka
  short: BMK
  kind: stream-processor
  core-technology: apache-kafka
  provides-kafka: 2.5.1
  provides-zookeeper: 3.5.8
requires:
  strong:
    - - key: kind
        operator: eq
        values: [infrastructure]
      - key: provider,
        operator: in,
        values:
          - azure
          - aws
  weak:
    - - key: kind
        operator: eq
        values:
          - logs-storage
    - - key: kind
        operator: eq
        values:
          - monitoring
      - key: core-technology
        operator: eq
        values:
          - prometheus
influences:
  - - key: kind
      operator: eq
      values:
        - monitoring

Init

init method main purpose is to jump start usage of module by generating (in smart way) configuration file using information in state. In example Makefile which is stored here you can test following scenario:

  • make clean
  • make init-and-apply-azure-infrastructure
  • observe what is in ./shared/state.yml file:
    azi:
      status: applied
      size: 5
      provide-pubips: true
      nodes:
        - privateIP: 10.0.0.0
          publicIP: 213.1.1.0
          usedBy: unused
        - privateIP: 10.0.0.1
          publicIP: 213.1.1.1
          usedBy: unused
        - privateIP: 10.0.0.2
          publicIP: 213.1.1.2
          usedBy: unused
        - privateIP: 10.0.0.3
          publicIP: 213.1.1.3
          usedBy: unused
        - privateIP: 10.0.0.4
          publicIP: 213.1.1.4
          usedBy: unused
    
    it mocked that it created some infrastructure with VMs having some fake IPs.
  • change IP manually a bit to observe what I mean by "smart way"
    azi:
      status: applied
      size: 5
      provide-pubips: true
      nodes:
        - privateIP: 10.0.0.0
          publicIP: 213.1.1.0
          usedBy: unused
        - privateIP: 10.0.0.100 <---- here
          publicIP: 213.1.1.100 <---- and here
          usedBy: unused
        - privateIP: 10.0.0.2
          publicIP: 213.1.1.2
          usedBy: unused
        - privateIP: 10.0.0.3
          publicIP: 213.1.1.3
          usedBy: unused
        - privateIP: 10.0.0.4
          publicIP: 213.1.1.4
          usedBy: unused
    
  • make just-init-kafka
  • observe what was generated in ./shared/bmk-config.yml
    bmk:
      size: 3
      clusterNodes:
        - privateIP: 10.0.0.0
          publicIP: 213.1.1.0
        - privateIP: 10.0.0.100
          publicIP: 213.1.1.100
        - privateIP: 10.0.0.2
          publicIP: 213.1.1.2
    
    it used what it found in state file and generated config to actually work with given state.
  • make and-then-apply-kafka
  • check it got applied to state file:
    azi:
      status: applied
      size: 5
      provide-pubips: true
      nodes:
        - privateIP: 10.0.0.0
          publicIP: 213.1.1.0
          usedBy: bmk
        - privateIP: 10.0.0.100
          publicIP: 213.1.1.100
          usedBy: bmk
        - privateIP: 10.0.0.2
          publicIP: 213.1.1.2
          usedBy: bmk
        - privateIP: 10.0.0.3
          publicIP: 213.1.1.3
          usedBy: unused
        - privateIP: 10.0.0.4
          publicIP: 213.1.1.4
          usedBy: unused
    bmk:
      status: applied
      size: 3
      clusterNodes:
        - privateIP: 10.0.0.0
          publicIP: 213.1.1.0
          state: created
        - privateIP: 10.0.0.100
          publicIP: 213.1.1.100
          state: created
        - privateIP: 10.0.0.2
          publicIP: 213.1.1.2
          state: created
    

So init method is not just about providing "default" config file, but to actually provide "meaningful" configuration file. What is significant here, is that it's very easily testable if that method generates desired state when given different example state files.

Plan

plan method is a method to:

  • validate that config file has correct structure,
  • get state file, extract module related piece and compare it to config to "calculate" if there are any changes required and if yes, than what are those.

This method should be always started before apply by cli wrapper.

General reason to this method is that after we "smart initialized" config, we might have wanted to change some values some way, and then it has to be validated. Another scenario would be influence mechanism I described in Influences section. In that scenario it's easy to imagine that output of BMK module would produce proposed changes to BareMetalMonitoring module or even apply them to its config file. That looks obvious, that automatic "apply" operation on BareMetalMonitoring module is not desired option. So we want to suggest to the user "hey, I applied Kafka module, and usually it influences the configuration of Monitoring module, so go ahead and do plan operation on it to check changes". Or we can even do automatic "plan" operation and show what are those changes.

Apply

apply is main "logic" method. Its purpose is to do 2 things:

  • apply module logic (i.e.: install software, modify a config, manage service, install infrastructure, etc.),
  • update state file.

In fact, you might debate which of those is more important, and I could argue that updating state file is more important.

To perform its operations it uses config file previously validated in plan step.

Audit

audit method use case is to check how existing components is "understandable" by component provider logic. A standard situation would be upgrade procedure. We can imagine following history:

  • I installed BareMetalKafka module in version 0.0.1
  • Then I manually customized configuration on cluster machines
  • Now I want to update BareMetalKafka to version 0.0.2 because it provides something I need

In such a scenario, checking if upgrade operation will succeed is critical one, and that is duty of audit operation. It should check on cluster machines if "known" configuration is still "known" (whatever it means for now) and that upgrade operation will not destroy anything.

Another use case for audit method is to reflect manually introduced changes into the configuration (and / or state). If I manually upgraded minor version of some component (i.e.: 1.2.3 to 1.2.4) it's highly possible that it might be easily reflected in state file without any trouble to other configuration.

Optional methods

There are also already known methods which would be required to have most (or maybe all) modules, but are not core to modules communication. Those are purely "internal" module business. Following examples are probably just subset of optional methods.

Backup / Restore

Provide backup and restore functionalities to protect data and configuration of installed module.

Update

Perform steps to update module components to newer versions with data migration, software re-configuration, infrastructure remodeling and any other required steps.

Scale

Operations related to scale up and scale down module components.

Check required methods implementation

All accessible methods would be listed in module metadata as proposed here. That means that it's possible to:

  • validate if there are all required methods implemented,
  • validate if required methods return in expected way,
  • check if state file is updated with values expected by other known modules.

All that means that we would be able to automate modules release process, test it separately and validate its compliance with modules requirements.

Future work

We should consider during development phase if and how to present in manifest what are external fields that module requires for apply operation. That way we might be able to catch inconsistencies between what one module provide and what another module require form it.

Another topic to consider is some standardization over modules labeling.

16.3 -

Ansible based module

Purpose

To provide separation of concern on middleware level code we need to have consistent way to produce ansible based modules.

Requirements

There are following requirements for modules:

  • Allow two-ways communication with other modules via Statefile
  • Allow a reuse of ansible roles between modules

Design

Components

  1. Docker – infrastructure modules are created as Docker containers so far so this approach should continue
  2. Ansible – we do have tons of ansible code which could be potentially reused. Ansible is also a de facto industry standard for software provisioning, configuration management, and application deployment.
  3. Ansible-runner – due to need of automation we should use ansible-runner application which is a wrapper for ansible commands (i.e.: ansible-playbook) and provides good code level integration features (i.e.: passing of variables to playbook, extracting logs, RC and facts cache). It is originally used in AWX.
  4. E-structures – we started to use e-structures library to simplify interoperability between modules.
  5. Ansible Roles – we need to introduce more loosely coupled ansible code while extracting it from main LambdaStack code repository. To achieve it we need to utilize ansible roles in “ansible galaxy” way, which means each role should be separately developed, tested and versioned. To coordinate multiple roles between they should be connected in a modules single playbook.

Commands

Current state of understanding of modules is that we should have at least two commands:

  1. Init – would be responsible to build configuration file for the module. In design, it would be exactly the same as “init” command in infrastructure modules.
  2. Apply – that command would start ansible logic using following order:
    1. Template inventory file – command would get configuration file and using its values, would generate ansible inventory file with all required by playbook variables.
    2. Provide ssh key file – command would copy provided in “shared” directory key into expected location in container

There is possibility also to introduce additional “plan” command with usage of “—diff” and “—check” flags for ansible playbook but:

  • It doesn't look like required step like in terraform-based modules
  • It requires additional investigation for each role how to implement it.

Structure

Module repository should have structure similar to following:

  • Directory “cmd” – Golang entrypoint binary files should be located here.
  • Directory “resources” – would be root for ansible-runner “main” directory
    • Subdirectory “project” – this directory should contain entrypoint.yml file being main module playbook.
      • Subdirectory “roles” – this optional directory should contain local (not shared) roles. Having this directory would be considered “bad habit”, but it's possible.
  • Files in “root” directory – Makefile, Dockerfile, go.mod, README.md, etc.

16.4 -

LambdaStack modular design document

Affected version: 0.4.x

Goals

Make lambdastack easier to work on with multiple teams and make it easier to maintain/extend by:

  1. Splitting up the monotithic LambdaStack into seperate modules which can run as standalone CLI tools or be linked together through LambdaStack.
  2. Create an extendable plug and play system for roles which can be assigned to components based on certain tasks: apply, upgrade, backup, restore, test etc

Architectural design

The current monolithic lambdastack will be split up into the following modules.

Module cli design proposal

Core

Shared code between other modules and not executable as standalone. Responsible for:

  • Config
  • Logging
  • Helpers
  • Data schema handling: Loading, defaults, validating etv.
  • Build output handling: Loading, saving, updating etc.
  • Ansible runner

Infrastructure

Module for creating/destroying cloud infrastructure on AWS/Azure/Google... + "Analysing" existing infrastructure. Maybe at a later time we want to split up the different cloud providers into plugins as well.

Functionality (rough outline and subjected to change):

  1. template:
    "lambdastack infra template -f outfile.yaml -p awz/azure/google/any (--all)"
    "infra template -f outfile.yaml -p awz/azure/google/any (--all)"?
    "Infrastructure.template(...)"
    Task: Generate a template yaml with lambdastack-cluster definition + possible infra docs when --all is defined
    Input:  File to output data, provider and possible all flag
    Output: outfile.yaml template
    
  2. apply:
    "lambdastack infra apply -f data.yaml"
    "infra apply -f data.yaml"?
    "Infrastructure.apply(...)"
    Task: Create/Update infrastucture on AWS/Azure/Google...
    Input:  Yaml with at least lambdastack-cluster + possible infra docs
    Output: manifest, ansible inventory and terrafrom files
    
  3. analyse:
    "lambdastack infra analyse -f data.yaml"
    "infra analyse -f data.yaml"?
    "Infrastructure.analyse(...)"
    Task: Analysing existing infrastructure
    Input:  Yaml with at least lambdastack-cluster + possible infra docs
    Output: manifest, ansible inventory
    
  4. destroy:
    "lambdastack infra destroy -b /buildfolder/"
    "infra destroy -b /buildfolder/"?
    "Infrastructure.destroy(...)"
    Task:  Destroy all infrastucture on AWS/Azure/Google?
    Input:  Build folder with manifest and terrafrom files
    Output: Deletes the build folder.
    

Repository

Module for creating and tearing down a repo + preparing requirements for offline installation.

Functionality (rough outline and subjected to change):

  1. template:
    "lambdastack repo template -f outfile.yaml  (--all)"
    "repo template -f outfile.yaml (--all)"?
    "Repository.template(...)"
    Task: Generate a template yaml for a repository
    Input:  File to output data, provider and possible all flag
    Output: outfile.yaml template
    
  2. prepare:
    "lambdastack repo prepare -os (ubuntu-1904/redhat-7/centos-7)"
    "repo prepare -o /outputdirectory/"?
    "Repo.prepare(...)"
    Task: Create the scripts for downloading requirements for a repo for offline installation for a certain OS.
    Input:  Os which we want to output the scripts for:  (ubuntu-1904/redhat-7/centos-7)
    Output: Outputs the scripts  scripts
    
  3. create:
    "lambdastack repo create -b /buildfolder/ (--offline /foldertodownloadedrequirements)"
    "repo create -b /buildfolder/"?
    "Repo.create(...)"
    Task: Create the repository on a machine (either by running requirement script or copying already prepared ) and sets up the other VMs/machines to point to said repo machine. (Offline and offline depending on --offline flag)
    Input:  Build folder with manifest and ansible inventory and posible offline requirements folder for onprem installation.
    Output: repository manifest or something only with the location of the repo?
    
  4. teardown:
    "lambdastack repo teardown -b /buildfolder/"
    "repo teardown -b /buildfolder/"?
    "Repo.teardown(...)"
    Task: Disable the repository and resets the other VMs/machines to their previous state.
    Input:  Build folder with manifest and ansible inventory
    Output: -
    

Components

Module for applying a command on a component which can contain one or multiple roles. It will take the Ansible inventory to determine which roles should be applied to which component. The command each role can implement are (rough outline and subjected to change):

  • apply: Command to install roles for components
  • backup: Command to backup roles for components
  • restore: Command to backup roles for components
  • upgrade: Command to upgrade roles for components
  • test: Command to upgrade roles for components

The apply command should be implemented for every role but the rest is optional. From an implementation perspective each role will be just a seperate folder inside the plugins directory inside the components module folder with command folders which will contain the ansible tasks:

components-|
           |-plugins-|
                     |-master-|
                     |        |-apply
                     |        |-backup
                     |        |-restore
                     |        |-upgrade
                     |        |-test
                     |
                     |-node-|
                     |      |-apply
                     |      |-backup
                     |      |-restore
                     |      |-upgrade
                     |      |-test
                     |
                     |-kafka-|
                     |       |-apply
                     |       |-upgrade
                     |       |-test

Based on the Ansible inventory and the command we can easily select which roles to apply to which components. For the commands we probably also want to introduce some extra flags to only execute commands for certain components.

Finally we want to add support for an external plugin directory where teams can specify there own role plguins which are not (yet) available inside LambdaStack itself. A feature that can also be used by other teams to more easily start contributing developing new components.

LambdaStack

Bundles all executable modules (Infrastructure, Repository, Component) and adds functions to chain them together:

Functionality (rough outline and subjected to change):

  1. template:
    "lambdastack template -f outfile.yaml -p awz/azure/google/any (--all)"
    "LambdaStack.template(...)"
    Task: Generate a template yaml with lambdastack-cluster definition + possible infrastrucure, repo and component configurations
    Input:  File to output data, provider and possible all flag
    Output: outfile.yaml with templates
    
  2. apply:
    "lambdastack apply -f input.yaml"
    "LambdaStack.template(...)"
    Task: Sets up a cluster from start to finish
    Input:  File to output data, provider and possible all flag
    Output: Build folder with manifest, ansible inventory, terrafrom files, component setup.
    

...

16.5 -

Intent

This document tries to compare 3 existing propositions to implement modularization.

Compared models

To introduce modularization in LambdaStack we identified 3 approaches to consider. Following sections will describe briefly those 3 approaches.

Dockerized custom modules

This approach would look following way:

  • Each component management code would be packaged into docker containers
  • Components would need to provide some known call methods to expose metadata (dependencies, info, state, etc.)
  • Each component would be managed by one management container
  • Components (thus management containers) can depend on each other in ‘pre-requisite’ manner (not runtime dependency, but order of executions)
  • Separate wrapper application to call components execution and process metadata (dependencies, info, state, etc.)

All that means that if we would like to install following stack:

  • On-prem Kubernetes cluster
  • Rook Operator with Ceph cluster working on that on-prem cluster
  • PostgreSQL database using persistence provided by Ceph cluster,

Then steps would need to look somehow like this:

  • CLI command to install PostgreSQL
  • It should check pre-requisites and throw an error that it cannot be installed because there is persistence layer missing
  • CLI command to search persistence layer
  • It would provide some possibilities
  • CLI command to install rook
  • It should check pre-requisites and throw an error that it cannot be installed because there is Kubernetes cluster missing
  • CLI command to search Kubernetes cluster
  • It would provide some possibilities
  • CLI command to install on-prem Kubernetes
  • It should perform whole installation process
  • CLI command to install rook
  • It should perform whole installation process
  • CLI command to install PostgreSQL
  • It should perform whole installation process

Terraform providers

This approach would mean following:

  • We reuse most of terraform providers to provide infrastructure
  • We reuse Kubernetes provider to deliver Kubernetes resources
  • We provide “operator” applications to wrap ansible parts in terraform-provider consumable API (???)
  • Separate wrapper application to instantiate “operator” applications and execute terraform

All that means that if we would like to install following stack:

  • On-prem Kubernetes cluster
  • Rook Operator with Ceph cluster working on that on-prem cluster
  • PostgreSQL database using persistence provided by Ceph cluster,

Then steps would need to look somehow like this:

  • Prepare terraform configuration setting up infrastructure containing 3 required elements
  • CLI command to execute that configuration
  • It would need to find that there is on-prem cluster provider which does not have where to connect, and it needs to instantiate “operator” container
  • It instantiates “operator” container and exposes API
  • It executes terraform script
  • It terminates “operator” container

Kubernetes operators

This approach would mean following:

  • To run anything, we need Kubernetes cluster of any kind (local Minikube is good as well)
  • We provide Kubernetes CR’s to operate components
  • We would reuse some existing operators
  • We would need to create some operators on our own
  • There would be need to separate mechanism to create “on-prem” Kubernetes clusters (might be operator too)

All that means that if we would like to install following stack:

  • On-prem Kubernetes cluster
  • Rook Operator with Ceph cluster working on that on-prem cluster
  • PostgreSQL database using persistence provided by Ceph cluster,

Then steps would need to look somehow like this:

  • Start Minikube instance on local node
  • Provide CRD of on-prem Kubernetes operator
  • Deploy on-prem Kubernetes operator
  • Wait until new cluster is deployed
  • Connect to it
  • Deploy rook operator definition
  • Deploy PostgreSQL operator definition

Comparision

Question Dockerized custom modules (DCM) Terraform providers (TP) Kubernetes operators (KO)
How much work does it require to package lambdastack to first module? Customize entrypoint of current image to provide metadata information. Implement API server in current image to expose it to TP. Implement ansible operator to handle CR’s and (possibly?) run current image as tasks.
Sizes: 3XL Too big to handle. We would need to implement just new modules that way. 5XL
How much work does it require to package module CNS? From kubectl image, provide some parameters, provide CRD’s, provide CR’s Use (possibly?) terraform-provider-kubernetes. Prepare CRD’s, prepare CR’s. No operator required. Just deploy Rook CRD’s, operator, CR’s.
Sizes: XXL XL XL
How much work does it require to package module AKS/EKS? From terraform, provide some parameters, provide terraform scripts Prepare terraform scripts. No operator required. [there is something called rancher/terraform-controller and it tries to be what we need. It’s alpha] Use (possibly?) rancher terraform-controller operator, provide DO module with terraform scripts.
Sizes: XL L XXL
How would be dependencies handled? Not defined so far. It seems that using kind of “selectors” to check if modules are installed and in state “applied” or something like this. Standard terraform dependencies tree. It’s worth to remember that terraform dependencies sometimes work very weird and if you change one value it has to call multiple places. We would need to assess how much dependencies there should be in dependencies. It seems that embedding all Kubernetes resources into helm charts, and adding dependencies between them could solve a problem.
Sizes: XXL XL XXL
Would it be possible to install CNS module on LambdaStack Kubernetes in version 0.4.4? yes yes yes
If I want to install CNS, how would dependencies be provided? By selectors mechanism (that is proposition) By terraform tree By helm dependencies
Let’s assume that in version 0.8.0 of LambdaStack PostgreSQL is migrated to new component (managed not in lambdastack config). How would migration from 0.7.0 to 0.8.0 on existing environments be processed? Proposition is, that for this kind of operations we can create separate “image” to conduct just that upgrade operation. So for example ls-o0-08-upgrader. It would check that environment v0.7.x had PostgreSQL installed, then it would generate config for new PostgreSQL module, it would initialize that module and it would allow upgrade of lambdastack module to v0.8.x It doesn’t look like there is a way to do it automatically by terraform. You would need to add new PostgreSQL terraform configuration and import existing state into it, then remove PostgreSQL configuration from old module (while preventing it from deletion of resources). If you are advanced terraform user it still might be tricky. I’m not sure if we are able to handle it for user. We would need to implement whole functionality in operator. Basically very similar to DCM scenario, but triggered by CR change.
Sizes: XXL Unknown 3XL
Where would module store it’s configuration? Locally in ~/.e/ directory. In future we can implement remote state (like terraform remote backend) All terraform options. As Kubernetes CR.
How would status of components be gathered by module? We would need to implement it. Standard terraform output and datasource mechanisms. Status is continuously updated by operator in CR so there it is.
Sizes: XL XS S
How would modules pass variables between each other? CLI wrapper should be aware that one module needs other module output and it should call module1 get-output and pass that json or part of it to module2 apply Standard terraform. Probably by Config resources. But not defined.
Sizes: XXL XS XL
How would upstream module notify downstream that something changed in it’s values? We would need to implement it. Standard terraform tree update. Too active changes in a tree should be considered here as in dependencies. It’s not clear. If upstream module can change downstream Config resource (what seems to be ridiculous idea) than it’s simple. Other way is that downstream periodically checks upstream Config for changes, but that introduces problems if we use existing operators.
Sizes: XXL XL XXL
Sizes summary: 1 3XL, 5 XXL, 2 XL 1 Too big, 1 Unknown, 3 XL, 1 L, 2 XS 1 5XL, 1 3XL, 3 XXL, 2 XL, 1 S

Conclusions

Strategic POV

DCM and KO are interesting. TP is too strict and not elastic enough.

Tactic POV

DCM has the smallest standard deviation when you look at task sizes. It indicates the smallest risk. TP is on the opposite side of list with the biggest estimations and some significant unknowns.

Fast gains

If we were to consider only cloud provided resources TP is the fastest way. Since we need to provide multiple different resources and work on-prem it is not that nice. KO approach looks like something interesting, but it might be hard at the beginning. DCM looks like simplest to implement with backward compatibility.

Risks

DCM has significant risk of “custom development”. KO has risks related to requirement to use operator-framework and its concept, since very beginning of lsc work. TP has huge risks related to on-prem operational overhead.

Final thoughts

Risks related to DCM are smallest and learning curve looks best. We would also be able to be backward compatible in relatively simple way.

DCM looks like desired approach.

16.6 -

Offline modes in modularised LambdaStack

Context

Due to ongoing modularization process and introduction of middleware modules we need to decide how modules would obtain required dependencies for “offline” mode.

This document will describe installation and upgrade modes and will discuss ways to implement whole process considered during design process.

Assumptions

Each module has access to the “/shared” directory. Most wanted way to use modules is via “e” command line app.

Installation modes

There are 2 main identified ways (each with 2 mutations) to install LambdaStack cluster.

  • Online - it means that one machine in a cluster has access to public internet. We would call this machine repository machine, and that scenario would be named "Jump Host". A specific scenario in this group is when all machines have access to internet. We are not really interested in that scenario because in all scenarios we want all cluster machines to download required elements from repository machine. We would call this scenario "Full Online"
  • Offline - it means that none of machines in a cluster have access to public internet. There are also two versions of this scenario. First version assumes that installation process is initialized on operators machine (i.e.: his/her laptop). We would call this scenario "Bastion v1". Second scenario is when all installation initialization process is executed directly from "Downloading Machine". We would call that scenario "Bastion v2".

Following diagrams present high-level overview of those 4 scenarios:

Jump Host

Full Online

Bastion v1

Bastion v2

Key machines

Described in the previous section scenarios show that there is couple machine roles identified in installation process. Following list explains those roles in more details.

  • Repository - key role in whole lifecycle process. This is central cluster machine containing all the dependencies, providing images repository for the cluster, etc.
  • Cluster machine - common cluster member providing computational resources to middleware being installed on it. This machine has to be able to see Repository machine.
  • Downloading machine - this is a temporary machine required to download OS packages for the cluster. This is known process in which we download OS packages on a machine with access to public internet, and then we transfer those packages to Repository machine on which they are accessible to all the cluster machines.
  • Laptop - terminal machine for a human operator to work on. There is no formal requirement for this machine to exist or be part of process. All operations performed on that machine could be performed on Repository or Downloading machine.

Downloading

This section describes identified ways to provide dependencies to cluster. There is 6 identified ways. All of them are described in following subsections with pros and cons.

Option 1

Docker image for each module has all required binaries embedded in itself during build process.

Pros

  • There is no “download requirements” step.
  • Each module has all requirements ensured on build stage.

Cons

  • Module image is heavy.
  • Possible licensing issues.
  • Unknown versions of OS packages.

Option 2

There is separate docker image with all required binaries for all modules embedded in itself during build process.

Pros

  • There is no “download requirements” step.
  • All requirements are stored in one image.

Cons

  • Image would be extremely large.
  • Possible licensing issues.
  • Unknown versions of OS packages.

Option 3

There is separate “dependencies” image for each module containing just dependencies.

Pros

  • There is no “download requirements” step.
  • Module image itself is still relatively small.
  • Requirements are ensured on build stage.

Cons

  • “Dependencies” image is heavy.
  • Possible licensing issues.
  • Unknown versions of OS packages.

Option 4

Each module has “download requirements” step and downloads requirements to some directory.

Pros

  • Module is responsible for downloading its requirements on its own.
  • Already existing “export/import” CLI feature would be enough.

Cons

  • Offline upgrade process might be hard.
  • Each module would perform the download process a bit differently.

Option 5

Each module has “download requirements” step and downloads requirements to docker named volume.

Pros

  • Module is responsible for downloading its requirements on its own.
  • Generic docker volume practices could be used.

Cons

  • Offline upgrade process might be hard.
  • Each module would perform the download process a bit differently.

Option 6

Each module contains “requirements” section in its configuration, but there is one single module downloading requirements for all modules.

Pros

  • Module is responsible for creation of BOM and single “downloader” container satisfies needs of all the modules.
  • Centralised downloading process.
  • Manageable offline installation process.

Cons

  • Yet another “module”

Options discussion

  • Options 1, 2 and 3 are probably unelectable due to licenses of components and possibly big or even huge size of produced images.
  • Main issue with options 1, 2 and 3 is that it would only work for containers and binaries but not OS packages as these are dependent on the targeted OS version and installation. This is something we cannot foresee or bundle for.
  • Options 4 and 5 will introduce possibly a bit of a mess related to each module managing downloads on its own. Also upgrade process in offline mode might be problematic due to burden related to provide new versions for each module separately.
  • Option 6 sounds like most flexible one.

Export

Its visible in offline scenarios that "export" process is as important as "download" process. For offline scenarios "export" has to cover following elements:

  • downloaded images
  • downloaded binaries
  • downloaded OS packages
  • defined modules images
  • e command line app
  • e environment configuration

All those elements have to be packaged to archive to be transferred to the clusters Repository machine.

Import

After all elements are packaged and transferred to Repository machine they have to be imported into Repository. It is current impression that repository module would be responsible for import operation.

Summary

In this document we provide high level definition how to approach offline installation and upgrade. Current understanding is:

  • each module provide list of it's requirements
  • separate module collects those and downloads required elements
  • the same separate module exports all artefacts into archive
  • after the archive is transferred, repository module imports its content

17 - Offline Upgrade

Desgin docs for Offline Upgrade

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack offline upgrade design document

Affected version: 0.4.x

Goals

Provide upgrade functionality for LambdaStack so Kubernetes and other components can be upgraded when working offline.

Use cases

LambdaStack should be upgradeable when there is no internet connection. It requires all packages and dependencies to be downloaded on machine that has internet connection and then moved to air-gap server.

Example use

lsupgrade -b /path/to/build/dir

Where -b is path to build folder that contains Ansible inventory.

Design proposal

MVP for upgrade function will contain Kubernetes upgrade procedure to the latest supported version of Kubernetes. Later it will be extended to all other LambdaStack components.

LambdaStack offline upgrade app

lsupgrade application or module takes build path location (directory path that contains Ansible inventory file).

First part of upgrade execution is to download/upload packages to repository so new packages will exist and be ready for upgrade process. When repository module will finish its work then upgrade Ansible playbooks will be executed.

Upgrade application/module shall implement following functions:

  • [MVP] apply it will execute upgrade
  • --plan where there will be no changes made to the cluster - it will return list of changes that will be made during upgrade execution.

18 - Persistence Storage

Desgin docs for Persistence Storage

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

Intent

This document aim is to initialize evaluation of possible persistence layers for Kubernetes cluster (a.k.a. Cloud Native Storage, CNS) in various setups.

Conditions

There is need to provide persistence layer for Kubernetes cluster installed as LambdaStack containers orchestrator. We need to consider performance of persistence layer as well. There is possibility to utilize external persistence solutions in future with managed Kubernetes clusters installations, but that is out of scope of this document.

OKR

This section proposes Objectives and Key Results for CNS.

  1. O1: Introduce Cloud Native Storage
    1. O1KR1: Have stable CNS released
    2. O1KR2: Have CNS performance tests automation
    3. O1KR3: Have CNS performance tests results

Possible solutions

As for now I can see following solutions:

  • Ceph managed by Rook Operator
  • GlusterFS (managed by Heketi or Kadalu, but that would need further assessment)

We should review more solutions presented here.

There are numerous other solutions possible to use over CSI, but they require separate management.

Requirements

  • It has to be able to work on-premise
  • It has to be able to work offline
  • There need to be known difference in performance of middleware components
  • Storage layer should be tightly integrated with Kubernetes
  • As much as possible automation is required (zero-management)

Tests

  • We need to have performance tests automated
  • Tests have to be executed daily
  • We should have PostgreSQL database performance tests automated
  • We should have kafka performance tests automated

Initial Plan

  1. Have LambdaStack cluster with PostgreSQL database
  2. Create performance test running in Kubernetes pod using PostgreSQL in current setup (pgbench can be used)
  3. Deploy rook operator and create Ceph cluster
  4. Create PostgreSQL database running in Kubernetes pod using Ceph PVC
  5. Run performance test using Kubernetes PostgreSQL instance
  6. Compare results

19 - PostgreSQL

Desgin docs for PostgreSQL

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack database connection design document

Affected version: 0.5.x

1. Introduction

Deploying PostgreSQL in a high-demand environment requires reliability and scalability. Even if you don't scale your infrastructure and you work only on one database node at some time you will reach connection limit. Number of connection to Postgres database is limited and is defined by max_connection parameter. It's possible to extend this limit, but you shouldn't do that reckless - this depends of machine resources.

2. Use case

LambdaStack delivers solution to build master - slave database nodes configuration. This means that application by default connects to master database. Database replica is updated immediately when master is modified.

3. Assumptions

  • Database replica is read only
  • Write data only to Control Plane Node
  • Select operations on replica
  • There is no Pgpool-II software available for Ubuntu 18.04 - not officially supported

4. Design proposal

4.1. Minimal configuration

Minimal solution to meet with client requirements is to install Pgbouncer on database master node to maintain connection pool. This will partially solve problem with exceeded connection limits. All applications need to be reconfigure to connect not directly with database, but with Pgbouncer service which will redirect connection to database master. This solution we can deliver fast and it's quite easy to implement.

4.2. High Availability configuration

High Availablility Cluster

Above chart presents high availability database cluster. Pgbouncer and Pgpool are located in separate pods in Kubernetes cluster. PGbouncer maintains connection pool and redirect them to pgpool which is responsible for connection pooling between master and slave node. This allows to redirect write operations to master database node and read (select) operations to slave database node(s). Additionally repmgr takes care of databases availability (must be installed on every database node), and promote subsequent slave node to be master when previous master went down.

LambdaStack PostgreSQL auditing design document

Affected version: 0.5.x

Goals

Provide functionality to perform auditing of operations performed on PostgreSQL.

Use cases

For SOX and other regulations compliance platform should provide auditing function for PostgreSQL database. This should be set via LambdaStack automation in LambdaStack configuration yaml.

Example use

In configuration for PostgreSQL we can add additional parameters, that could configure additional properties of PostgreSQL. Config similar to proposed below can be used to configure auditing with using pgaudit.

kind: configuration/postgresql
title: PostgreSQL
name: default
specification:
  ...
  extensions:
    pgaudit:
      enabled: false
      shared_preload_libraries:
        - pgaudit
      config_file_parameters:
        pgaudit.log: 'all, -misc'
        log_connections: 'on'
        log_disconnections: 'on'
        log_line_prefix: "'%m [%p] %q%u@%d,host=%h '"
        log_statement: 'none'
  ...

Design proposal

Add to PostgreSQL configuration additional settings, that would install and configure pgaudit extension. For RHEL we use PostgreSQL installed from Software Collections repository, which doesn't provide pgaudit package for PostgreSQL versions older than 12. For this reason, on RHEL pgaudit will be installed from PostgreSQL repository.

20 - Ceph (Rook)

Desgin docs for Ceph

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

Replication / configuration

Configuration data is stored in location: /var/lib/ceph Storage data is located on dedicated devices which are connected via OSD pods.

Replication: Like Ceph Clients, Ceph OSD Daemons use the CRUSH algorithm, but the Ceph OSD Daemon uses it to compute where replicas of objects should be stored (and for rebalancing). In a typical write scenario, a client uses the CRUSH algorithm to compute where to store an object, maps the object to a pool and placement group, then looks at the CRUSH map to identify the primary OSD for the placement group. The client writes the object to the identified placement group in the primary OSD. Then, the primary OSD with its own copy of the CRUSH map identifies the secondary and tertiary OSDs for replication purposes, and replicates the object to the appropriate placement groups in the secondary and tertiary OSDs (as many OSDs as additional replicas), and responds to the client once it has confirmed the object was stored successfully.

Prerequisite

Since version 1.4 lvm package present on the nodes is required. It applies for AWS machines (not tested on Ubuntu) Example installation command:

RHEL:

yum install lvm2 -y

Rook ceph design

https://rook.io/docs/rook/v1.4/ceph-storage.html

Cluster setup

Rook ceph cluster can be easily deployed using example default definitions from GH repo:

git clone --single-branch --branch release-1.4 https://github.com/rook/rook.git

open location:

rook/cluster/examples/kubernetes/ceph

and list examples:

    -rw-r--r--. 1 root root 395 Jul 28 13:00 ceph-client.yaml
    -rw-r--r--. 1 root root 1061 Jul 28 13:00 cluster-external-management.yaml
    -rw-r--r--. 1 root root 886 Jul 28 13:00 cluster-external.yaml
    -rw-r--r--. 1 root root 5300 Jul 28 13:00 cluster-on-pvc.yaml
    -rw-r--r--. 1 root root 1144 Jul 28 13:00 cluster-test.yaml
    -rw-r--r--. 1 root root 10222 Jul 28 14:47 cluster.yaml
    -rw-r--r--. 1 root root 2143 Jul 28 13:00 common-external.yaml
    -rw-r--r--. 1 root root 44855 Jul 28 13:00 common.yaml
    -rw-r--r--. 1 root root 31424 Jul 28 13:00 create-external-cluster-resources.py
    -rw-r--r--. 1 root root 2641 Jul 28 13:00 create-external-cluster-resources.sh
    drwxr-xr-x. 5 root root 47 Jul 28 13:00 csi
    -rw-r--r--. 1 root root 363 Jul 28 13:00 dashboard-external-https.yaml
    -rw-r--r--. 1 root root 362 Jul 28 13:00 dashboard-external-http.yaml
    -rw-r--r--. 1 root root 839 Jul 28 13:00 dashboard-ingress-https.yaml
    -rw-r--r--. 1 root root 365 Jul 28 13:00 dashboard-loadbalancer.yaml
    -rw-r--r--. 1 root root 1554 Jul 28 13:00 direct-mount.yaml
    -rw-r--r--. 1 root root 3308 Jul 28 13:00 filesystem-ec.yaml
    -rw-r--r--. 1 root root 780 Jul 28 13:00 filesystem-test.yaml
    -rw-r--r--. 1 root root 4286 Jul 28 13:00 filesystem.yaml
    drwxr-xr-x. 2 root root 115 Jul 28 13:00 flex
    -rw-r--r--. 1 root root 4530 Jul 28 13:00 import-external-cluster.sh
    drwxr-xr-x. 2 root root 183 Jul 28 13:00 monitoring
    -rw-r--r--. 1 root root 1409 Jul 28 13:00 nfs.yaml
    -rw-r--r--. 1 root root 495 Jul 28 13:00 object-bucket-claim-delete.yaml
    -rw-r--r--. 1 root root 495 Jul 28 13:00 object-bucket-claim-retain.yaml
    -rw-r--r--. 1 root root 2306 Jul 28 13:00 object-ec.yaml
    -rw-r--r--. 1 root root 2313 Jul 28 13:00 object-openshift.yaml
    -rw-r--r--. 1 root root 698 Jul 28 13:00 object-test.yaml
    -rw-r--r--. 1 root root 488 Jul 28 13:00 object-user.yaml
    -rw-r--r--. 1 root root 3573 Jul 28 13:00 object.yaml
    -rw-r--r--. 1 root root 19075 Jul 28 13:00 operator-openshift.yaml
    -rw-r--r--. 1 root root 18199 Jul 28 13:00 operator.yaml
    -rw-r--r--. 1 root root 1080 Jul 28 13:00 pool-ec.yaml
    -rw-r--r--. 1 root root 508 Jul 28 13:00 pool-test.yaml
    -rw-r--r--. 1 root root 1966 Jul 28 13:00 pool.yaml
    -rw-r--r--. 1 root root 410 Jul 28 13:00 rgw-external.yaml
    -rw-r--r--. 1 root root 2273 Jul 28 13:00 scc.yaml
    -rw-r--r--. 1 root root 682 Jul 28 13:00 storageclass-bucket-delete.yaml
    -rw-r--r--. 1 root root 810 Jul 28 13:00 storageclass-bucket-retain-external.yaml
    -rw-r--r--. 1 root root 681 Jul 28 13:00 storageclass-bucket-retain.yaml
    -rw-r--r--. 1 root root 1251 Jul 28 13:00 toolbox.yaml
    -rw-r--r--. 1 root root 6089 Jul 28 13:00 upgrade-from-v1.2-apply.yaml
    -rw-r--r--. 1 root root 14957 Jul 28 13:00 upgrade-from-v1.2-crds.yaml

After creating basic setup (common.yaml, operator.yaml, cluster.yaml) install toolbox (toolbox.yaml) as well for checking the ceph cluster status.

IMPORTANT:

ensure the osd container is created and running. It requires a storage device to be available on the nodes.

During cluster startup it searches for the devices available and creates osd containers for them.

Kubelet nodes have to use a default flag enable-controller-attach-detach set to true. Otherwise PVC will not attach to the pod.

Location of the file where we can find the flag:

/var/lib/kubelet/kubeadm-flags.env 

on every worker nodes with kubelet. After that we need to restart kubelet:

systemctl restart kubelet

If cluster is working we can create a storage which can be one of a type:

Block: Create block storage to be consumed by a pod
Object: Create an object store that is accessible inside or outside the 
Kubernetes cluster
Shared Filesystem: Create a filesystem to be shared across multiple pods

Eg.

-> filesystem.yaml and then

-> storageclass.yaml

CRD:

There are 2 ways cluster can be set up:

  • Host-based Cluster
  • PVC-based Cluster

PVC example:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rbd-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: rook-ceph-block

Application using PVC example:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgresql
  namespace: default
  labels:
    k8s-app: postgresql
    kubernetes.io/cluster-service: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: postgresql
  template:
    metadata:
      labels:
        k8s-app: postgresql
        kubernetes.io/cluster-service: "true"
    spec:
      containers:
        - name: postgres
          image: postgres:10.1
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_DB
              value: dbdb
            - name: POSTGRES_USER
              value: test
            - name: POSTGRES_PASSWORD
              value: test
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - mountPath: "/var/lib/postgresql/data"
              name: "image-store"
      volumes:
      - name: image-store
        persistentVolumeClaim:
          claimName: rbd-pvc
          readOnly: false

Choosing Block Storage which allows a single pod to mount storage, be aware that if one node where Your application is hosted will crash, all the pods located on the crashed node will go into terminating state and application will be unavailable since terminating pods blocking access to ReadWriteOnce volume and new pod can't create. You have to manually delete volume attachment or use CephFS instead of RBD.

Related discussion: https://stackoverflow.com/questions/61186199/why-does-kubernetes-not-terminating-pods-after-a-node-crash

Internal k8s automated setup and tests

Step by step procedure for setting environment up and testing it (together with backup/restore) is available in the following repo: https://github.com/mkyc/k8s-rook-ceph

Good starting point:

https://rook.io/docs/rook/v1.4/ceph-quickstart.html

Toolbox for debugging:

https://rook.io/docs/rook/v1.4/ceph-toolbox.html

Filesystem storage:

https://rook.io/docs/rook/v1.4/ceph-filesystem.html

Custom Resource Definitions:

https://rook.io/docs/rook/v1.4/ceph-cluster-crd.html

Add/remove osd nodes: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/administration_guide/adding_and_removing_osd_nodes

Useful rook ceph guide: https://www.cloudops.com/2019/05/the-ultimate-rook-and-ceph-survival-guide/