Autoscaling

Desgin docs for Autoscaling

3 minute read

Some of these date back to older versions but efforts are made to keep the most important - sometimes :)

LambdaStack Autoscaling

Affected version: 0.7.x

1. Goals

We want to provide automatic scale up / down feature for cloud-based LambdaStack clusters (currently Azure and AWS).

Clusters will be resized in reaction to the resource utilisation (CPU and Memory).
Existing LambdaStack automation will be reused and optimized for the purpose of autoscaling.
Additional nodes will be added (removed) to (from) running Kubernetes clusters.
Horizontal Pod Autoscaler will be used to control number of pods for particular deployment.

2. Design proposal

PHASE 1: Adding ability to scale-down the pool of worker nodes.

Current LambdaStack codebase does not allow to scale-down Kubernetes clusters in the nice / proper way.
This is crucial for autoscaling to work, as we need to properly drain and delete physically-destroyed nodes from Kuberentes.
Also this step needs to be performed before terraform code is executed (which requires a refactor of lambdastack code).

PHASE 2: Moving terraform's state and lambdastack-cluster-config to a shared place in the cloud.

Currently LambdaStack keeps state files and cluster configs in the build/xxx/ directories, which causes them not to be shared easily.
To solve the issue, terraform beckends can be used: for Azure and for AWS.
For simplicity the same "bucket" can be used to store and share lambdastack-cluster-config.

PHASE 3: Building packer images to quickly add new Kubernetes nodes.

Autoscaling is expected to react reasonably quickly. Providing pre-built images should result in great speed-ups.
Packer code should be added to the lambdastack codebase somewhere "before" the terraform code executes.

PHASE 4: Realistic provisioning minimalization and speedup.

Currently LambdaStack's automation takes lots of time to provision clusters.
Limits and tags can be used to filter-out unnecessary plays from ansible execution (for now, narrowing it just to the Kubernetes node provisioning).

PHASE 5: Adding ability to authenticate and run lambdastack from a pod.

To be able to execute lambdastack form a running LambdaStack cluster, it is required to deploy SSH keys and cloud access configuration (ie. Service Principal).
SSH keys can be created and distributed automatically (in Ansible) just for the purpose of autoscaling.
For now, it seems resonable to store them in Kubernetes secrets (later the Hashicorp Vault will be used).

PHASE 6: Introducing python application that will execute lambdastack from a pod (in reaction to performance metrics) to scale the pool of worker nodes.

Metrics can be obtained from the metrics server.
For simplicity, standard CPU / Memory metrics will be used, but later it should be posible to introduce custom metrics taken from Prometheus.
Best way to package and deploy the application would be to use Helm (v3).
The docker image for the application can be stored in a public docker registry.

PHASE 7: Introducing standard Horizontal Pod Autoscaler to scale pods in LambdaStack clusters.

To scale Kubernetes pods in LambdaStack clusters the Horizontal Pod Autoscaler will be used.
This step will be dependent and the user / customer (user will deploy and configure proper resources inside Kubernetes).

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.