Kubernetes: The First 17 AI-Certified Platforms

Istio and Kueue for some, Traefik and Volcano for others; sometimes Kubeflow, sometimes KubeRay, or the two… The providers of Kubernetes platforms have pursued various paths to demonstrate their compliance with the CNCF’s AI specification.

This spec defines a set of capabilities, APIs and configurations that a cluster certified as “Kubernetes-conformant” must offer to run AI/ML workloads reliably. The principal objective: avoid fragmentation that would compromise portability.

A first round of self-certification with 9 mandatory elements

The work officially kicked off this summer. A v1 has been published since, and a first round of certification has been launched. More precisely, self-certification: the process is currently declarative. An automated suite of tests is expected to take over, but not before 2026.

Read also: Kubernetes: the CNCF projects most deployed in production

Many items listed in the specification are, at least for the moment, not mandatory. Among them:

Ensuring that compatible drivers and the corresponding runtime configurations are correctly installed and maintained on accelerator-equipped nodes
Facilitating the pulling of large container images (for example through replication or caching near the execution nodes)
Enabling unified management of independent jobs via a mechanism that manages the JobSet APIs
Allowing the deployment of confidential containers in secure execution environments (hardware enclaves)
Providing a mechanism to detect accelerators in error, with potentially automated remediation

Nine elements are currently mandatory. In broad terms:

Support dynamic resource allocation (DRA)
Manage the Kubernetes Gateway API, in an implementation enabling “advanced management” for inference services (weighted traffic distribution, header-based routing, integration with service meshes…)
Allow the installation and operation of at least one gang-scheduling solution
Handle vertical scaling of node groups containing specific accelerator types
If present, ensure proper operation of the HorizontalPodAutoscaler for pods using accelerators
Expose, for the supported accelerator types, granular metrics via a standardized endpoint, and in machine-readable format (at minimum, utilization rate per accelerator + memory occupancy)
Discovery and collection of workloads’ metrics in a standard format
Isolate access to accelerators from containers
Ability to install and reliably run at least one complex AI operator with a CRD

Each certification lasts one year and applies to a specific Kubernetes version. At present, either 1.33 (8 certified solutions) or 1.34 (11 certified solutions).

The 8 self-certified solutions on Kubernetes 1.33

First in alphabetical order, CoreWeave Kubernetes Services (CKS).
Among other notes in its compliance statement, the American “neo-cloud” — see our article on it — recalls that it manages the SUNK scheduler (Slurm on Kubernetes). It explains also that access isolation is for now handled with plug-ins, while awaiting to move to DRA when vendor support matures.

DaoCloud Enterprise does not implement the DRA (to be precise, the APIs are disabled by default on Kubernetes 1.33, so the spec does not require the feature). Aimed at on-premises deployment, it also does not provide a vertical autoscaler.

For its platform Gardener, the NeoNephos Foundation (a project of the Linux Foundation Europe) outputs a proof of Gateway API implementation via Traefik. And gang scheduling via Kueue. Horizontal autoscaling is handled with a stack combining Prometheus and DCGM (NVIDIA Datacenter GPU Manager). As an “AI complex operator,” KubeRay was chosen.

The German company Giant Swarm provides the platform of the same name. It did not add comments to its self-certification. The references to its documentation show, however, that Kueue was selected to demonstrate compliance on the gang-scheduling part, and KubeRay as an AI operator.

Red Hat is also in the mix, with the latest OpenShift release (4.20). It too opted for Kueue. As an AI operator, the IBM subsidiary used Kubeflow Trainer, with several CRDs (TrainJob, TrainingRuntime, ClusterTrainingRuntime). It notes that, regarding accelerator metrics, dedicated operators are offered for AMD GPUs in addition to NVIDIA GPUs.

Read also: Knative, the serverless layer for Kubernetes, reaches maturity

SUSE self-certified RKE2 (the second iteration of Rancher Kubernetes Engine). Again, no additional commentary, but a pointer to a new section of its documentation devoted to CNCF spec compliance. There it becomes evident that Volcano was favored for gang scheduling. And that SUSE AI is highlighted for metrics collection.

Red Hat also self-certified a second product: ROSA (Red Hat OpenShift Service on AWS), in its latest version. With the same base as OpenShift, but with specific validations.

Talos Linux, an immutable OS for Kubernetes, was also certified, by its publisher Sidero Labs. It notes that no dedicated vertical autoscaler is provided and that the product does not ship observability tools out of the box.

The 11 self-certified solutions on Kubernetes 1.34

First alphabetically, ACK (Alibaba Cloud Container Service for Kubernetes). Its compliance was demonstrated using both Spark and Ray. On the metrics side, Alibaba leveraged its managed Prometheus service.

AKS (Azure Kubernetes Service) has also been self-certified. Microsoft used Istio, Kueue and DCGM, among others. For AI operators, it made a particular choice beyond Ray: KAITO (Kubernetes AI Toolchain Operator), a CNCF sandbox project based on vLLM.

Baidu has self-certified its CCE (Cloud Container Engine). With Volcano for the Gateway API implementation, a managed Prometheus for horizontal autoscaling… and a deployment of SGLang for the AI controller.

Self-certified on Kubernetes 1.33, CoreWeave Kubernetes Service (CKS) is also certified on 1.34.

Read also: Kubernetes: Databricks’ choices for load balancing

Amazon largely relied on its own services to demonstrate the conformity of EKS (Elastic Kubernetes Service). Among others, its AWS Load Balancer Controller, its AWS Batch scheduler, its CloudWatch monitoring, and its Neuron Monitor metrics collector.

GKE (Google Kubernetes Engine) is also self-certified. As with Amazon, Google highlights its own services… and a tutorial aimed at building an ML platform combining Ray and Kubeflow.

KKP (Kubermatic Kubernetes Platform) has its own MLA stack (“Monitoring Logging & Alerting”), used in its self-certification. It also has its own gateway controller (KubeLB).

With LKE (Linode Kubernetes Engine), Akamai has its own vertical autoscaler. For the pods, it relies on the Prometheus adapter. Metrics collection related to accelerators goes through DCGM. Istio is used as the reference implementation of the Gateway API.

Istio was also the choice of Oracle to demonstrate the conformity of OKE (OCI Kubernetes Engine). It is noted that for workload metrics, the American group has its own OCI GPU Scanner project, released under a free license (UPL) and installable either via Terraform, via Helm, or as an add-on from the OCI console.

Autocertified on Kubernetes 1.33, Talos Linux is also certified on version 1.34.

The last alphabetically is VKS (VMware Kubernetes Service). VMware self-certified it by relying notably on Istio, Kueue, Prometheus, DCGM and KubeRay.