E-TF1's Containerized Infrastructure Optimization Strategies for Enhanced Performance

Keeping Resources Internal or Moving to Open Source: A Strategic Dilemma for e-TF1

The teams at e-TF1 have recently confronted a significant strategic question: should they retain certain tools and processes within their organization or transition them into open source projects? This debate arose specifically around one of the tools they developed to automate environment shutdowns during low-traffic periods, a critical component of their cloud resource management strategy.

Automating Cloud Resource Shutdowns at the Cluster Level

The tool in question manages the automatic deactivation of non-production environments used during off-peak hours. Its operation is centered on individual Kubernetes clusters, which are configured to operate autonomously throughout their lifecycle. To implement this functionality, e-TF1 considered two open source solutions.

First, Kube-green, a tool primarily designed to pause or scale down deployment replicas and scheduled tasks (cron jobs). It functions by storing the previous state of resources in secrets, allowing a quick restoration when necessary. Second, Kubecost Cluster Turndown, which interacts with the scaling process of cluster nodes through Custom Resource Definitions (CRDs). Both approaches aim to reduce costs by halting unnecessary workloads.

Limitations and Compatibility Challenges

However, each of these open source options presents significant limitations. Kube-green operates only on resources managed by the operator, such as StatefulSets, but does not support other critical components. For example, it cannot manage StatefulSets that are not linked to the operator, which are often used for databases or other stateful applications. Moreover, Kube-green can potentially conflict with GitOps practices, which promote declarative infrastructure management and automated deployments.

Meanwhile, Kubecost Cluster Turndown does not integrate seamlessly with Karpenter, e-TF1’s chosen autoscaling solution. Karpenter dynamically provisions compute resources based on workload demands, offering a more flexible alternative to the traditional Cluster Autoscaler. Since Kubecost’s approach is incompatible with Karpenter’s architecture, implementing it could lead to operational inconsistencies.

Development of an Internal Solution

Given these constraints, e-TF1’s engineering team developed an in-house tool specifically designed to manage Karpenter’s nodepools effectively. This internal solution handles the backup and subsequent destruction of node pools, managed by Karpenter, while ensuring smooth operation and avoiding unnecessary alerts and notifications. Prior to shutting down a node pool, temporary silence modes are activated in Alertmanager to prevent alert noise during outages. When a node pool is deleted, its underlying EC2 instances are shut down gracefully.

This internal approach adds a layer of complexity, introducing a temporary disconnect with Terraform—the Infrastructure as Code (IaC) tool of choice—and necessitates ongoing maintenance. Nonetheless, it provides precise control tailored to e-TF1’s specific infrastructure setup.

Considering Future Integration of Open Source Tools

The team recognizes the advantages of integrating functionalities into open source solutions in the long term to reduce maintenance overhead. This would involve potentially consolidating features within community-supported tools to streamline workflows and improve compatibility with cloud autoscaling components.

Autoscaling with Karpenter and KEDA

In parallel to managing environment shutdowns, e-TF1 has embarked on modernizing its scaling infrastructure. They replaced the traditional Cluster Autoscaler with Karpenter, a more flexible and efficient autoscaling engine. For scaling individual applications, they adopted KEDA (Kubernetes Event-Driven Autoscaler), which supplants the Horizontal Pod Autoscaler (HPA).

KEDA provides the privilege of scaling based on multiple triggers, such as Prometheus metrics, SQS queues, or Kafka topics, enabling fine-tuned adjustments based on real-time business needs. It operates via a Custom Resource Definition called ScaledObject, which specifies scaling policies tailored to each workload.

Proactive Overprovisioning and Load Management

To handle unexpected traffic spikes, e-TF1 decided to provision approximately 5% extra capacity beyond normal demand—known as overprovisioning. This is implemented through a Helm chart called Cluster Overprovisioner, developed by the German company codecentric. The tool combines two mechanisms:

– Kubernetes’ Cluster Proportional Autoscaler (CPA), which scales the cluster size based on the number of schedulable nodes and available CPU cores.

– Low-priority pod deployment, which enqueues resource requests that do not impact regular workloads. These pods are evicted when higher-priority pods need resources, allowing for flexible resource management and triggering additional cluster autoscaling activities via the Cluster Autoscaler.

Node Pool Management and Targeted Taints

For efficient resource allocation, e-TF1 manages dedicated node groups with specific taints. For instance, there are general-purpose node pools and others optimized for disk-intensive workloads, ensuring workloads are scheduled on appropriate nodes. This stratification enhances performance and cost-efficiency.

Addressing Resources Outside Kubernetes

Beyond autoscaling, e-TF1 considers how to power down resources outside of Kubernetes clusters. For non-critical data services such as RDS and ElastiCache managed by AWS, they favor single-AZ configurations when high availability isn’t essential to minimize costs. On Amazon S3, they deactivate buckets that are not critical for data preservation to save storage expenses. Similarly, they explore leveraging AWS Graviton instances, which promise approximately 10% savings coupled with better performance, as another avenue for cost reduction.

However, some resources require unique shutdown strategies. Automatically created load balancers and infrastructure components provisioned through tools like Crossplane demand bespoke approaches. The team is contemplating combining Terraform destroy-apply cycles—automating environment teardown followed immediately by reconstruction—to achieve deeper resource decommissioning. Although effective, this approach increases recovery times, with cluster rebuilds potentially taking over 20 minutes for EKS or RDS, and up to 45 minutes with OpenSearch.

Tools and Infrastructure Management

E-TF1’s deployment leverages Terraform combined with Terragrunt, which simplifies code management, alongside Atlantis for Continuous Integration/Continuous Deployment (CI/CD) pipelines. Their applications run within containers on EKS, providing a flexible, scalable environment. Additionally, a small part of their infrastructure remains on-premises, notably an internal CDN used for video content delivery.

Conclusion

Faced with the complex challenge of balancing automation, cost-efficiency, and operational stability, e-TF1’s teams are actively exploring and developing various solutions. While open source tools offer promising capabilities, their limitations and compatibility issues compel the organization to build tailored internal solutions. As their infrastructure evolves, a hybrid approach—integrating proprietary tools with community projects—may offer the optimal path forward for managing resources efficiently in a dynamic cloud environment.

E-TF1’s Containerized Infrastructure Optimization Strategies for Enhanced Performance